In many fields of engineering and science like medico-technical image processing, visualizing large quantities of image data quickly is a major concern. Because of the size of modern voxel datasets, especially in medical imaging, to calculate all items of a volume dataset, an enormous amount of memory accesses have to take place. Since the computation power improvement of the hardware due to parallelization advances by increasing the number of computation cores is larger than the memory bandwidth and memory latency advancement, for volume rendering algorithms, the volume data memory access performance in many cases becomes the limiting factor that determines the overall performance. To achieve an optimal performance it is therefore crucial to minimize the impact of memory accesses by maximizing cache reuse. Many known volume rendering procedures show significant disadvantages with respect to cache reuse.
Depending on the respective computer platform different approaches for volume rendering are known. In case of a shared memory system using multi-core processors a popular approach is to distribute the workload necessary to render the image by decomposing the image space, which is used for volume or image rendering. The result image is factorized into many small tiles, where each tile holds a part of the full image. Then, the different parts of tiles of the image may be calculated on different hardware elements and/or by different threads.
On most hardware, the data to be accessed is cached, so that once a data set was loaded it remains in a memory cache for some time. The access to a cache memory is much faster than to standard memory. The memory access strategy is optimal if every data set in a random access memory (RAM) that contains voxel data is loaded only once into the cache for each rendered image or never. Typically, the memory system comprises caches in multiple hierarchies consisting of multiple cache levels where the last level cache (LLC) of this hierarchy typically is shared between all cores.
Mainly there do exist two classes for volume rendering on multiprocessor systems:                object space decomposition algorithms, where the volume is decomposed into sub-blocks and        image space decomposition algorithms, where the result image is decomposed into tiles and        a combination of the before mentioned approaches.        
The present application is directed to the second category, namely to image space decomposition algorithms and may be combined with object space decomposition schemes.
Generally, image space decomposition algorithms are also used in the context of raytracing or rasterizing graphic primitives like meshes and splines.
In the paper “Image-space decomposition algorithms for sort-first parallel volume rendering of unstructured grids”, Kutluca H. et al, 1997 three general categories for image space decomposition algorithms are described:                1. Static decomposition        2. Dynamic decomposition and        3. Adaptive decomposition.        
Static decomposition schemes decompose the image into regions that are assigned statically to the available threads. This static load balancing approach suffers from the problem that the rendering load may not be distributed evenly across the processors.
Dynamic load balancing is based on the idea that each thread queries dynamically for a new workload as soon as it has finished its previous rendering work. Dynamic algorithms decompose the image into small tiles. Each thread receives a tile, computes the image and when done, it requests the next free tile that has to be rendered. This approach guarantees that no thread is idle while there is still work to do.
An improvement to this basic dynamic algorithm is to additionally adapt the size of the workloads. Here at the beginning not just one tile is queried by a thread, but multiple tiles. Later, when the number of unhandled tiles has decreased, the number of the queried tiles decreases until just one tile is queried and one arrives at the basic algorithm (see S. Parker et al. 1998).
This modification is based on the idea to reduce thread contention at the beginning, while arriving at the basic scheme to not loose granularity to allow still a fine grained work load distribution.
Another proposed modification is to decompose the initial image into equally sized rectangular regions, where each processor is assigned to such a region. Each region then consists of smaller tiles and each processor handles the tiles of a block in scanline order. When a processor is done computing all its tiles, it goes to the next processor in scanline order that is not finished with its workload and grabs tiles that still have to be calculated. Such an algorithm improves the tile locality for each processor at the beginning, since at the beginning each processor works on tiles that are close to each other. How large this effect is depends on how the overall workload is distributed across the initial regions.
Another dynamic algorithm for tile decomposition was proposed by Whitman (S. Whitman, 1994). Here, the image is decomposed into rectangular regions, where the number of regions is at least as large as the number of processors. Initially each processor is assigned to such a region and when this region is finished, it requests the next free region. Inside of a region the internal workload is handled along scanlines. When for a processor no free region is left, the workload of the processor that has the most work items open is split into two workloads. Since the open workloads form a rectangular region, the split decomposes that rectangular region into two smaller rectangular regions and the free processor handles the lower one, while the other processor continues his work on the upper part. By always starting to work on a region at the upper part, it is guaranteed that the current processor can continue on his workload even during a split, until it finishes its assigned region. A lock prevents that more than one processor at a time tries to split the same region.
Adaptive decomposition algorithms use additional assumptions to create the image decomposition to take a varying complexity distribution across the image into account. Such additional data is usually created by an estimation of the expected workload that has to be handled, with the goal to create a decomposition of the image so that all created regions have the same amount of expected workload. Often such approaches use space partitioning algorithms to calculate a solution.
Other additional information that can be used is the workload of the previously calculated image, assuming that during an interaction the differences between the workload required in a region does not change to a larger degree between two image calculations.
Adaptive algorithms are combined with static or dynamic algorithms to distribute the workload. When combined with dynamic algorithms, the assumption information can also be used to influence the work distribution within the dynamic scheduling procedure, e.g. by sorting the tiles according to the expected workload and first handle those tiles with higher expected workload.
When doing volume rendering on a volume data format that stores the volume as a linear slice array, the caching behavior of the volume renderer is strongly influenced by the orientation of the volume relative to the viewing camera. In US 2011/0170756, the entire contents of which are hereby incorporated herein by reference, an algorithm was presented that allows overcoming this issue for calculating the volume samples on a regular grid, which e.g. is satisfied in the case of volume rendering using an orthographic projection. In case of a 3-dimensional block within a 3-dimensional regular grid, it was shown how to find a permutation of the sampling positions that optimizes cache reuse, when sampling the volume voxels in a sequence that is defined by that permutation.
Further details on the above mentioned approaches, the entire contents of each of which is hereby incorporated herein by reference, may be found e.g. in:    F. Abraham, W. Celes, R. Cerqueira and J. Campos, “A load-balancing strategy for sort-first distributed rendering”, in 17th Brazilian Symposium on Computer Graphics and Image Processing, 2004, pp. 292-299.    H. Kutluca, T. M. Kurc and C. Aykanat. “Image-space decomposition algorithms for sort-first parallel volume rendering of unstructured grids”, The Journal of Supercomputing, vol. 15, pp. 51-93, 1997.    B. B. Labronici, C. Bentes, L. Maria, A. Drummond and R. Farias, “Dynamic screen division for load balancing the raycasting of irregular data”, Cluster Computing and Workshops. IEEE CLUSTER 2009. Pp. 1-10, 2009.    K-L. Ma, J. S. Painter, C. D. Hansen and M. F. Krogh, “A data distributed, parallel algorithm for ray-traced volume rendering”, Parallel Rendering Symposium, pp. 15-22, 1993.    S. Marchesin, C. Mongenet and J-M. Dischler, “Dynamic load balancing for parallel volume rendering”, Eurographics Symposium on Parallel Graphics and Visualization 2006, pp. 43-50.    J. Nieh and M. Levoy, “Volume rendering on scalable shared-memory MIMD architectures”, in Proc. Of the 1992 Workshop on Volume Visualization, October 1992, pp. 17-24.    M. E. Palmer, B. Totty, S. Taylor: “Ray Casting on Shared-Memory Architectures. Memory-Hierarchy Considerations in Volume Rendering”, IEEE Concurrency, IEEE Service Center, Piscataway, NY, US, vol. 6, No. 1, January 1998, pp. 20-35, XP000737883.    S. Parker, P. Shirley, Y. Livnat, C. Hansen, P.-P. Sloan. Interactive ray tracing for isosurface rendering. VIS '98: Proceedings of the conference on Visualization '98, Pages: 233-238.    R. Samanta, J. Zheng, T. Funkhouser, K. Li and J. P. Singh, “Load Balancing for multi-projector rendering systems”, in Proc. Of the SIGGRAPH/Eurographics Workshop on Graphics Hardware, August 1999, pp. 107-116.    R. Schneider, “Method for sampling volume data of an object in an imaging device”, United States Patent Application No. 20110170756, mentioned above.    J. P. Singh, P. Jaswinder and A. Gupta and M. Levoy. Parallel Visualization Algorithms: Performance and Architectural Implications. Computer, pages 45-55, 1994.    S. Tabik, F. Romero, G. Utrera and O. Plata, “Mapping parallel loops on multicore systems”, In 15th Workshop on compilers for parallel computing (CPC 2010), Vienna, July 2010.    S. Whitman, “Dynamic load balancing for parallel polygon rendering”, IEEE Computer Graph. and Appl., vol. 14, no. 4, pp. 41-48, July 1994.