Sampling of volume data on a regular grid is an essential and time consuming part of many algorithms that are used e.g. in medical imaging appliances such as NMR- or CT-scanners. Examples that fall into this class are all kind of volume rendering algorithms (e.g. VRT, MIP, MINIP, DDR, SSD, MPR) based on an orthographic projection using a ray-casting approach. Volume rendering is just one example where volume data has to be sampled massively on a regular grid, other examples are e.g. automatic registration or volume resampling.
In the case of volume rendering using an orthographic camera, such a regular grid arises when using a ray-casting approach. The basic form of ray-casting comprises the following steps: For each pixel of the final two-dimensional image, a ray of sight is shot (i.e. “cast”) through the volume. Along the part of the ray of sight that lies within the volume, usually equidistant sampling positions or samples are selected. As in general the voxel dataset describing the volume is not aligned with the ray of sight, sampling positions usually will be located in between voxels. Because of that, it is necessary to interpolate the values of the samples from its surrounding voxels. The sampling results along the ray are then used to calculate the pixel data in the planar result image.
Another very important case where such regular sampling grids appear is automatic volume registration using e.g. mutual information; here the sampling results are used to construct the mutual histogram. In this case the regular grid arises since the samples are defined by the regular voxel grid of one of the two volumes that have to be registered.
A third important case is volume resampling, where one applies an affine transformation on a given volume and needs to resample this transformed volume to form a new volume.
In the following we will mainly focus on volume rendering as example application, but we clearly want to emphasize that the presented algorithm works generally for all cases where one has to do sampling calculations on a regular grid.
The simplest kind of interpolation used when determining the value of the samples is based on Nearest Neighbour interpolation and just retrieves the value of the voxel that is most close to the sampling position. However, due to image quality reasons, sampling is usually at least done using trilinear interpolation, which means that at least 8 voxels of the volume dataset have to be touched to produce the result of a single sample. Better image quality can be achieved using higher order interpolation schemes like cubic interpolation, such higher order interpolation schemes soon require an even larger amount of voxels of the volume dataset to be taken into account per sample. Nearest Neighbour, trilinear and tricubic interpolation are just some examples of volume sampling, generally a sampling in the following can be interpreted as any operation where calculations are done using the voxel data in a local neighbourhood of a sampling position as input. For example in the case of rendering a shaded VRT such a generalized sampling operation would be the local gradient computation that has to be done at each sampling position.
Because of the size of modern voxel datasets in medical imaging, to calculate all samples on such a regular grid of sampling positions that are inside the volume dataset, an enormous amount of memory accesses have to take place. Since the computation power improvement of the hardware due to parallelization advances by increasing the number of computation cores is larger than the memory bandwidth and memory latency advancement, for algorithms that heavily do volume sampling, the volume data memory access performance in many cases becomes the limiting factor that determines the overall performance.
On most hardware, accessed memory lines are cached, so that once a memory line was loaded it remains in a memory cache for some time. The access to a cache memory is much faster than to standard memory. The memory access strategy is optimal if every memory line in RAM that contains voxel data is loaded only once into the cache for each rendered image or never. If a memory line has to be loaded multiple times into the cache when sampling all regular grid positions, this effect is called cache thrashing.
The size of the thrashing effect usually largely depends on the viewing orientation, assuming the dataset is stored in a linear storage format (slice array) as it is usually the case in medical software frameworks. This effect can be visualized by analyzing the rendering performance depending on the viewing orientation.
It is well known that the performance difference between the fastest and the slowest viewing orientation can become dramatically huge (e.g. on a computer system based on Intel Harpertown CPUs a difference between the fastest viewing orientation and the slowest viewing orientation up to a factor of 40 in some scenarios is noticeable), where the exact number depends on the used hardware (see FIGS. 30 and 35). Because of these enormous performance differences it is the top priority of any volume rendering algorithm to minimize cache thrashing effects.
It is not necessary that there is just one memory cache level available; often there exists a hierarchy of such cache levels. The lowest level cache usually is the smallest one but the fastest, the following levels increase in size but decrease on cache access performance. This complicates the cache thrashing problem, since we not just have to minimize accesses to memory that is not part of a cache, but moreover have to take the cache hierarchy into account to achieve optimal performance.
To improve the memory access performance of volume sampling there are multiple general approaches possible:    A) modify the volume sampling algorithm itself so that memory accesses are more cache friendly.
There are many volume rendering algorithms that fall into this category: One of the most popular algorithms of this category is the original Shear-Warp algorithm. In this technique, the viewing transformation is transformed such that the nearest face of the volume becomes axis aligned with an off-screen image buffer with a fixed scale of voxels to pixels. The volume is then rendered into this buffer using the far more favorable memory alignment and fixed scaling and blending factors. Once all slices of the volume have been rendered, the buffer is then warped into the desired orientation and scaled in the displayed image.
This technique is fast since it is designed to access the memory very efficiently if the dataset is stored in a linear storage format. It can be seen as a mixture of image based and object based volume rendering. The drawback of this approach is that the image quality is not as good as the one from ray-casting based algorithms and that the produced result image quality depends on the orientation angle. Many algorithms were derived from the original Shear-Warp algorithm to decrease the quality impact, so that one could speak nowadays of a family of Shear-Warp algorithms. (see e.g. P. Lacroute and M. Levoy. Fast volume rendering using a shear-warp factorization of the viewing transformation. In Proceedings of ACM SIGGRAPH, pages 451-458, 1994, the entire contents of which are hereby incorporated herein by reference.)
Other algorithms that fall into this category are those based on splatting techniques, which can be categorized as object based volume rendering techniques. Here, each voxel is projected onto the result image and its color is distributed across nearby pixels of the 2D image. As for the Shear-Warp family, also here the memory can be accessed very efficiently if it is stored in a linear format, but on the other side it does not produce the same image quality as ray-casting based approaches.
In general, these approaches usually mean that one trades image quality against performance, which is in most cases not acceptable in medical applications where high quality result images are required.    B) modify the way how the volume dataset is stored in RAM.
A popular approach to reorganize the data is known as bricking. Instead of storing the whole dataset linear in memory, the dataset is decomposed into small 3D blocks of fixed size. If the block is small enough to fit into the largest level of the cache hierarchy, sampling of this block can be done more efficiently. The idea behind this approach is then to decompose the problem to render the whole dataset into the problem to render many small blocks and to compose the intermediate results to the final image. The disadvantage of this approach is that an additional effort is created to decompose the volume rendering task and to compose the results. Also an additional effort has to be taken into account at the boundary of the blocks, so that all voxels that are required for sampling are also available at the block boundary. (see e.g. Daniel Weiskopf, Manfred Weiler, Thomas Ertl. Maintaining Constant Frame Rates in 3D Texture-Based Volume Rendering. CGI '04: Proceedings of the Computer Graphics International, pages 604-607, 2004, the entire contents of which are hereby incorporated herein by reference; K. Ma et al., A Data Distributed, Parallel Algorithm for Ray-Traced Volume Rendering. Proc. Parallel Rendering Symp., ACM, 1993, pp. 15-22, the entire contents of which are hereby incorporated herein by reference.)
Another popular approach to optimize the memory accesses by reorganizing the data is known as swizzling. Here, compared to storing the volume linear in memory, the indices of the swizzled dataset are permutated, so that voxels with close indices are closer together in 3D space than it would be the case with a linear memory arrangement. When accessing memory, each voxel index of a linear layout is transformed into its swizzled index before it is processed any further. Usually, to minimize the cost of this additional re-indexing step the permutation function is chosen so that the index computation can be optimized using lookup tables. (see e.g. Steven Parker, Peter Shirley, Yarden Livnat, Charles Hanse, Peter-Pike Sloan. Interactive ray tracing for isosurface rendering. VIS '98: Proceedings of the conference on Visualization '98, Pages: 233-238, the entire contents of which are hereby incorporated herein by reference.)
In general, instead of storing a volume dataset in a linear layout (slice-array), these approaches modify the volume data storage format so that memory accesses along rays become more coherent and less orientation dependent, leading to a faster average performance. However, a modified, non-standardized data storage format might lead to problems in data exchange and data usage in later processing. Moreover additional computational costs can arise since the indices of all voxels that are used during the sampling step have to be transformed to the storage format.    C) modify the order how the sample positions along the rays are evaluated.
By grouping rays into packets instead of handling single rays sequentially, memory coherency is improved considerably. In practice, these approaches are usually combined with B) since grouping of rays into packets alone reduces cache misses but does not yield excellent performance for all viewing orientations (see FIGS. 30 and 35-40).
The bricking approach mentioned above can also be modified so that the volume is still stored linear in memory, but the volume is nevertheless virtually subdivided into blocks. A block here defines a group of voxels where all samples that lie within are handled completely before switching to the next block. The blocks are sorted in front to back order as defined by the viewing orientation. The image is then calculated by applying an advancing ray-front algorithm onto the blocks, where voxels within the block only have to be loaded once from memory into the cache, assuming that the largest level of the cache hierarchy is large enough to hold the block. This approach minimizes cache thrashing for the largest cache hierarchy level, but not necessarily for all cache hierarchies. Additionally it has the disadvantage that a good amount of overhead is created since rays have to be clipped for each traversed brick and additional data structures have to be maintained during the ray-front traversal to store the information required when advancing the ray-front. Also it would be possible to realize this approach without having to modify the volume storage format, because of performance reasons, this approach is typically combined with the abovementioned bricking approach. (see e.g. A. Law and R. Yagel. Multi-frame thrashless ray casting with advancing ray-front, In Proceedings of Graphics Interfaces, pages 70-77, 1996, the entire contents of which are hereby incorporated herein by reference; and/or Palmer M E, Brian Totty, Stephen Taylor: “Ray Casting on Shared-Memory Architectures. Memory-Hierarchy Considerations in Volume Rendering”, IEEE Concurrency, IEEE Service Center, Piscataway, N.Y., US, vol. 6, No. 1, January 1998, pp. 20-35, XP000737883, the entire contents of which are hereby incorporated herein by reference.)
Another approach that is very popular is tile decomposition using ray packets. Here, the result image is decomposed into small tiles and the overall problem to calculate the whole result image is decomposed into the problem to calculate all small tiles. Each tile sub-image is calculated by sampling the required rays slice by slice before the sampling distance is increased to move on along the rays to the next slice. Because of the ray coherence of this strategy, many samples will access the same voxel data. It is more likely that a voxel will be found in the cache already because one of the neighbor rays could have triggered moving the voxel into the cache. Or the voxel could be still in the cache when the ray moves along since the number of rays in the tile is small, which decreases the likelihood that many memory lines have to be touched when sampling all rays at a given distance from their start position.
The big advantage of this algorithm is that it allows a very simple implementation on most hardware, e.g. on CPUs using SIMD or on programmable graphics cards using CUDA or OpenGL. The drawback of this approach compared to the abovementioned block approach is that cache thrashing still occurs noticeably when using a linear slice array volume data format, where the size of the cache thrashing effect is depending on the orientation of the viewing direction. This effect can be detected by comparing the rendering performance for various orientations. The fastest orientation will usually be the one where the viewing orientation is parallel to a voxel line that lies linear in memory. Another issue that can be detected is that when modifying the tile rectangle in its size, such a modification will increase the performance for various viewing orientations and decrease it for others (see FIGS. 30 and 35-40). To minimize the orientation dependency and further improve the performance, this approach is often combined with bricking or swizzling. (see e.g. Aaron Knoll, Charles Hansen and Ingo Wald. Coherent Multiresolution Isosurface Ray Tracing. Scientific Computing and Imaging Institute, University of Utah. Technical Report No UUSCI-2007-001, the entire contents of which are hereby incorporated herein by reference.)
Instead of packeting rays that pass the pixels of the result image as described above, which means to packet just along one direction of the grid of sampling positions, another approach is to also support traversal along the other directions of the grid. Here, one analyzes all angles of the grid directions and basically chooses from them that direction where the angle between the chosen direction and the x-direction of the volume becomes minimal, since the x-direction of a volume would be the direction that guaranties an especially good cache reuse rate for linear slice arrays. In comparison to tile decomposition this approach increases the number of viewing directions that have a good cache reuse rate and thus improves also the average rendering time. However, for grids where no grid direction is close to the x-direction of the volume, this approach can not prevent larger cache thrashing effects. This means that in order to minimize the orientation dependency and further improve the performance, also this approach would also have to be combined with bricking or swizzling. (see e.g. US 2007/0065001 A1, the entire contents of which are hereby incorporated herein by reference.)
A currently very popular approach for doing volume rendering is to exploit the rendering hardware of GPUs by using a graphics API like OpenGL or Direct3D. The volume is here stored as an array of 2D textures or as bricks of 3D textures. Also those GPU approaches can be interpreted as algorithms that are a combination of B) and C), since also here those approaches basically just differ in how the volume is stored and how the sampling positions along the rays are evaluated. The major difference here is that some parts are not directly visible to the developer but are hidden by the GPU hardware and the used graphics APIs.
In the abovementioned cases, the volume rendering problem is decomposed into partitions that naturally allow an implementation on a multiprocessor system. Here, the first approach falls into the category known as object decomposition, while the latter approaches fall into the category known as image decomposition. Further details on these known approaches, the entire contents of each of which is hereby incorporated herein by reference, can be found e.g. in:    Michael Abrash. Rasterization on Larrabee. http://software.intel.com/en-us/articles/rasterization-on-larrabee/.    Stefan Bruckner. Efficient Volume Visualization of Large Medical Datasets. Mastersthesis, Institute of Computer Graphics and Algorithms, Vienna University of Technology, 2004.    G. Knittel. The Ultravis system. In Proceedings of IEEE Symposium on Volume Visualization, pages 71-79, 2000.    Gerd Marmitt, Heiko Friedrich, and Philipp Slusallek, Interactive Volume Rendering with Ray Tracing, State-of-the-Art Report (STAR) at Eurographics 2006, Vienna, Austria, Nov. 4-8, 2006    Seiler, Larry et al., Larrabee: a many-core x86 architecture for visual computing, ACM Trans. Graph., Volume 27, Pages 1-15, 2008.    Singh, Jaswinder Pal and Gupta, Anoop and Levoy, Marc. Parallel Visualization Algorithms: Performance and Architectural Implications. Computer, pages 45-55, 1994.    Sören Grimm, Stefan Bruckner, Armin Kanitsar and Eduard Gröller. A refined data addressing and processing scheme to accelerate volume raycasting. Computers & Graphics, Volume 28, Issue 5, October 2004, Pages 719-729.