The present invention is related to the field of computer graphics, and in particular to a volume rendering pipeline.
Volume graphics is the subfield of computer graphics that deals with the visualization of objects or phenomena represented as sampled data in three or more dimensions. These samples are called volume elements, or xe2x80x9cvoxels,xe2x80x9d and contain digital information representing physical characteristics of the objects or phenomena being studied. For example, voxel values for a particular object or system may represent density, type of material, temperature, velocity, or some other property at discrete points in space throughout the interior and in the vicinity of that object or system.
Volume rendering is the part of volume graphics concerned with the projection of volume data as two-dimensional images for purposes of printing, display on computer terminals, and other forms of visualization. By assigning colors and transparency to particular voxel data values, different view of the exterior and interior of an object or system can be displayed. For example, a surgeon needing to examine the ligaments, tendons, and bones of a human knee in preparation for surgery can utilize a tomographic scan of the knee and cause voxel data values corresponding to blood, skin, and muscle to appear to be completely transparent. The resulting image then reveals the condition of the ligaments, tendons, bones, etc. which are hidden from view prior to surgery, thereby allowing for better surgical planning, shorter surgical operations, less surgical exploration and faster recoveries. In another example, a mechanic using a tomographic scan of a turbine blade or welded joint in a jet engine can cause voxel data values representing solid metal to appear to be transparent while causing those representing air to be opaque. This allows the viewing of internal flaws in the metal that would otherwise be hidden from the human eye.
Real-time volume rendering in the projection and display of volume data as a series of images in rapid succession, typically at 30 frames per second or faster. This makes it possible to create the appearance of moving pictures of the object, phenomenon, or system of interest. It also enables a human operator to interactively control the parameters of the projection and to manipulate the image, thus providing the user with immediate visual feedback. It will be appreciated that projecting tens of millions or hundreds of millions of voxel values to an image requires enormous amounts of computing power. Doing so in real time requires substantially more computational power.
Additional general background on volume rendering is presented in a book entitled xe2x80x9cIntroduction to Volume Renderingxe2x80x9d by Barthold Lichtenbelt, Randy Crane, and Shaz Naqvi, published in 1998 by Prentice Hall PTR of Upper Saddle River, New Jersey. Further background on volume rendering architectures is found in a paper entitled xe2x80x9cTowards a Scalable Architecture for Real-time Volume Renderingxe2x80x9d presented by H. Pfister, A. Kaufman, and T. Wessels at the 10th Eurographics Workshop on Graphics Hardware at Masstricht, The Netherlands, on Aug. 28 and 29, 1995. This paper describes an architecture now known as xe2x80x9cCube4.xe2x80x9d The Cube 4 is also described in a Doctoral Dissertation entitled xe2x80x9cArchitectures for Real-Time Volume Renderingxe2x80x9d submitted by Hanspeter Pfister to the Department of Computer Science at the State University of New York at Stony Brook in Dec. 1996, and in U.S. Pat. No. 5,594,842, xe2x80x9cApparatus and Method for Real-time Volume Visualization.xe2x80x9d
Cube 4 and other architectures achieve real-time volume rendering using the technique of parallel processing. A plurality of processing elements are deployed to concurrently perform volume rendering operations on different portions of a volume data set, so that the overall time required to render the volume is reduced in substantial proportion to the number of processing elements. In addition to requiring a plurality of processing elements, parallel processing of volume data requires a high-speed interface between the processing elements and a memory storing and volume data, so that the voxels can be retrieved from the memory and supplied to the processing elements at a sufficiently high data rate to enable the real-time rendering to be achieved.
Volume rendering as performed by Cube 4 is an example of a technique known as xe2x80x9cray-casting.xe2x80x9d A large number of rays are passed through a volume in parallel and processed by evaluating the volume data a slice at a time, where a xe2x80x9cslicexe2x80x9d is a planar set of voxels parallel to a face of the volume data set. Using fast slice-processing technique in specialized hardware, as opposed to software, frame processing rates can be increased to be higher than two frames per second.
The essence of the Cubexe2x80x944 system is that the three dimensional sampled data representing the object is distributed across the memory modules by a technique called xe2x80x9cskewing,xe2x80x9d so that adjacent voxels in each dimension are stored in adjacent memory modules independent of view direction. Each memory module is dedicated to its own processing pipeline. Moreover, voxels are organized in the memory modules so that if there are a total of P pipelines and P memory modules, then P adjacent voxels can be fetched in parallel within a single clock cycle of a computer memory system, independent of the view direction. This reduces the total time to fetch voxels from memory by a factor of P. For example, if the data set has 2563 voxels and P has the value of four, then only 2563/4 or approximately four million memory cycles are needed to fetch the data in order to render an image.
An additional characteristic of the Cubexe2x80x944 system is that the computational processing required for volume rendering is organized into pipelines with specialized functions for this purpose. Each pipeline is capable of starting the processing of a new voxel in each cycle. Thus, in the first cycle, the pipeline fetches a voxel from its associated memory module and performs the first step of processing. Then in the second cycle, the pipeline performs the second step of processing of this first voxel, while at the same time fetching the second voxel and performing the first step of processing this voxel. Likewise, in the third cycle, the pipeline performs the third processing step of the first voxel, the second processing step of the second voxel, and the first processing step of the third voxel. In this manner, voxels from each memory module progress through its corresponding pipeline in lock-step fashion, one after the another, until all voxels are fully processed. Thus, instead of requiring 10 and 100 software instructions per voxel, a new voxel can be processed in every clock cycle.
Skewing can disperse adjacent voxels over any of the pipelines, and since the pipelines are dedicated to memory modules, the Cubexe2x80x944 system must communicate voxel data with four other pipelines, i.e., the two neighboring pipelines on either side. Such communication is required, for example, to transmit voxel values from one pipeline to another for purposes such as estimating gradients or normal vectors so that lighting and shadow effects can be calculated. Pipeline interconnects are used to communicate the values of rays as they pass through the volume accumulating visual characteristics of the voxels in the vicinities of the areas through which they pass. Having, a large number of interconnects among the pipelines increase the complexity of the system.
In the Cubexe2x80x944 system, volume rendering proceeds as follows. Data are organized as a cube or other parallelepiped data structure. Considering first the face of this cube or solid that is most nearly perpendicular to the view direction, a partial beam of P voxels at the top corner is fetched from P memory modules concurrently, in one memory cycle, and inserted into the first stage of the P processing pipelines. In the second cycle these voxels are moved to the second stage of their respective pipelines. At the same time, the next P voxels are fetched from the same beam and inserted into the first stage of their pipelines. In each subsequent cycle, P more voxels are fetched from the top beam and inserted into their pipelines, while previously fetched voxels move to later stages of their pipelines. This continues until the entire beam of voxels has been processed. In the terminology of the Cubexe2x80x944 system, a row of voxels is called a xe2x80x9cbeamxe2x80x9d and a group of P voxels within a beam is called a xe2x80x9cpartial beam.xe2x80x9d
After the groups of voxels in a beam have been processed, the voxels of the next beam are processed, and so on, until all of the beams of the face of the volume date set have been fetched and inserted into their processing pipelines. This face is called a xe2x80x9cslice.xe2x80x9d Then, the Cubexe2x80x944 system moves again to the top corner, but this time starts fetching the P voxels in the top beam immediately behind the face, that is from the second xe2x80x9cslice.xe2x80x9d In this way, it progresses through the second slice of the data set, a beam at a time and within each beam, P voxels at time. After completing the second slice, it proceeds to the third slice, then to subsequent slices in a similar manner, until all slices have been processed. The purpose of this approach is to fetch and process all of the voxels in an orderly way, P voxels at a time, until the entire volume data set has been processed and an image has been rendered.
The processing stages of the Cubexe2x80x944 system perform all of the calculations required for the ray-casting technique, including interpolation of samples, estimation of the gradients or normal vectors, assignments of colors and transparency or opacity, and calculation of lighting and shadow effects to produce the final image on the two dimensional view surface.
The Cubexe2x80x944 system is designed to be capable of being implemented in semiconductor technology. However, two limiting factors prevent Cubexe2x80x944 from achieving the small size and low cost necessary for personal or desktop-size computers, namely the rate of accessing voxel values from memory modules, and the amount of internal storage required in each processing pipeline. With regard to the rate of accessing memory, the method of skewing voxel data across memory modules in Cubexe2x80x944 leads to inefficient patterns of accessing voxel memory that are a slow as random accesses. Therefore, in order to achieve real-time volume rendering performance, voxel memory in a practical implementation of Cubexe2x80x944 must either comprise very expensive static random access memory (SRAM) modules or a very large number of independent Dynamic Random Access Memory (DRAM) modules to provide adequate access rates. With regard to the internal storage, the Cubexe2x80x944 algorithm requires that each processing pipeline stores intermediate results within itself during processing, the amount of storage being proportional to the area of the face of the volume data set being rendered. For a 2563 data set, this amount turns out to be so large that the size of a single chip processing pipeline is excessive, and therefore impractical for a personal computer system.
In order to make real-time volume rendering practical for personal and desktop computers, an improvement upon the Cubexe2x80x944 system referred to as xe2x80x9cEM Cubexe2x80x9d employs techniques including architecture modifications to permit the use of high capacity, low cost Dynamic Random Access Memory or DRAM devices for memory modules. The EM Cube system is described in U.S. patent application Ser. No. 08/905,238, filed Aug. 1, 1997, entitled xe2x80x9cReal-Time PC Based Volume Rendering Systemxe2x80x9d, and is further described in a paper by R. Osborne, H. Pfister, et al. entitled xe2x80x9cEM-Cube: An Architecture for Low-Cost Real-Time Volume Rendering, xe2x80x9d published in the Proceedings of the 1997 SIGGraph/Eurographics Workshop on Graphics Hardware, Los Angeles, California, on Aug. 3-4, 1997.
The EM-Cube system utilizes DRAM chips that support xe2x80x9cburst modexe2x80x9d access to achieve both low cost and high access rates to voxel memory. In order to exploit the burst mode, EM Cube incorporates architectural modifications that are departures from the Cubexe2x80x944 system. In a first modification, called xe2x80x9cblocking,xe2x80x9d voxel data are grouped into blocks, independent of a view direction, so that all voxels within a block are stored at consecutive memory addresses within a single memory module. Each processing pipeline fetches an entire block of neighboring voxels in a burst rather than one voxel at a time. In this way, a single processing pipeline can access memory at data rates of 125 million or more voxels per second, thus making it possible for four processing pipelines and four DRAM modules to render 2563 data sets at 30 frames per second.
In EM Cube, each block is processed in its entirely within the associated processing pipeline. EM Cube employs an inter-chip communication scheme to enable each pipeline to communicate intermediate values to neighboring pipelines as required. For example, when a pipeline in EM Cube encounters either the right, bottom or rear face of a block, it is necessary to transmit partially accumulated rays and other intermediate values to the pipeline that is responsible for processing the next block located on the other side of the respective face. Significant inter-chip communication bandwidth is required to transmit these intermediate values to any other pipeline. However, the amount of inter-chip communication is reduced by blocking.
Like Cube 4, the EM Cube architecture is designed to be scalable, so that the same basic building blocks can be used to build systems with significantly different cost and performance characteristics. In particular, the above-described block processing technique and inter-chip communication structure of EM Cube are designed such that systems using different numbers of chips and processing pipelines can be implemented. Thus, block-oriented processing and high-bandwidth inter-chip communication help EM Cube to achieve its goals of real-time performance and scalability. It will be appreciated, however, that these features also have attendant costs, notably the cost of providing area within each processing pipeline for block storage buffers and also the costs of chip I/O pins and circuit board area needed to effect the inter-chip communication.
In a second modification to the Cubexe2x80x944 architecture, EM Cube also employs a technique called xe2x80x9csectioningxe2x80x9d in conjunction with blocking in order to reduce the amount of on-chip buffer storage required for rendering. In this technique, the volume data set is subdivided into sections and rendered a section at a time. Partially accumulated rays and other intermediate values are stored in off-chip memory across section boundaries. Because each section presents a face with a smaller area to the rendering pipeline, less internal storage is required. The effect of that technique is to reduce the amount of intermediate storage in a processing pipeline to an acceptable level for semiconductor implementation.
Sectioning in EM Cube is an extension of the basic block-oriented processing scheme and is supported by some of the same circuitry required for the communication of intermediate values necessitated by the block processing architecture. However, sectioning in EM Cube results in very bursty demands upon off-chip memory modules in which partially accumulated rays and other intermediate values are stored. That is, intermediate data are read and written at very high data rates when voxels near a section boundary are being processed, while at other times no intermediate data are being read from or written to the off-chip memory. In EM Cube it is sensible to minimize the amount of intermediate data stored in these off-chip memory modules in order to minimize the peak data rate to and from the off-chip memory when processing near a section boundary. Thus in EM Cube many of the required intermediate values are re-generated within the processing pipelines rather than being stored in and retrieved from the off-chip memory modules. During the processing carried out in each section near the boundary with the preceding section, voxels from the preceding section are re-read and partially processed in order to re-establish the intermediate values in the processing pipeline that are required for calculation in the new seciton.
While the EM Cube system achieves greater cost effectiveness than the prior Cube 4 system, it would be desirable to further lower costs to enable more widespread enjoyment of the benefits of volume rendering. Further, it would be desirable to achieve such cost reductions while retaining real-time performance levels. It would also be desirable to achieve rendering performance of 2563 voxels at 24 frames per second, or better, with a single integrated semiconductor chip.
An single integrated circuit (IC) includes a plurality of 3D graphic rendering pipelines. The integrated circuit can be mounted on a circuit boards along with memories and other interface logic, such as buses. The circuit board, while plugged into a PC or workstation, i.e., a host computer, enables the host to perform real-time 3D graphic rendering of a volume data set stored in the memories.
The pipelines of the IC operate in parallel on an array of voxels of the volume data set. Each pipeline receives one voxel during each processing cycle of the pipeline. Each pipeline is identical to the other pipelines, and each pipeline includes a number of different processing stages connected serially within each pipeline.
Each pipeline includes FIFO buffers to store data related to voxels that are spatially adjacent in the array, but that are processed during temporally different processing cycles. The FIFO buffers are connected in parallel to the pipelines. The FIFO buffers operate as delay lines so that the results of previously processed voxels can be combined with results of later processed voxels.
The stages of the pipelines include interfaces to transmit rendering data to only one neighboring identical pipeline stage, and to receive rendering data from only one other neighboring identical pipeline stage. Effectively, the interfaces connect identical stages of different pipelines in a one-way communications ring.
Each pipeline may include one or more of the following stages: an interpolation stage, a gradient estimation stage, a classification stage, an illumination stage, and a compositing stage.