The invention relates to volume rendering, in particular how to provide acceleration of the volume rendering when using a computer system that includes a graphics processing unit (GPU).
Volume rendering is a standard method of displaying 2D projections of the 3D data sets collected by medical imaging equipment, such as computer-assisted tomography (CT) scanners, magnetic resonance (MR) scanners, ultrasound scanners and positron-emission-tomography (PET) systems. In the early days of medical imaging, volume rendering was performed on vendor-specific software and hardware associated with the scanner. However, for a number of years, application software to implement volume rendering on standard personal computers (PCs) and workstations is well known which does not utilize any bespoke hardware.
It is also the case that modern personal computers and workstations include a graphics card, and in most cases the graphics card includes a Graphics Processing Unit (GPU). Typically a GPU consists of the following units:                A geometry processor, commonly called a Vertex Shader (VS). Its function is to perform coordinate transformations on polygons and other primitives.        A rasterization unit, whose purpose is to convert polygons that emerge from the VS into pixel clusters for shading.        A pixel processor commonly called a Pixel Shader (PS). Its function is to compute the shading, texture, and other visual properties of pixels.        Other circuits such as a frame buffer, Z-buffer, hierarchical Z-buffer, stencil buffer, RAMDAC, etc. These are not of relevance to the invention.        A memory hierarchy, typically comprising: registers in the pixel shader; on-chip cache; off-chip on-board DRAM; and access to the memory of the host system via a bus.        
In terms of aggregate processing power, modern GPUs outperform CPUs by roughly an order of magnitude. They achieve this by parallel processing using a Single Instruction Multiple Data (SIMD) architecture. The SIMD architecture allows a large number of processing elements to operate on the GPU chip simultaneously, but introduces dependencies between the processing elements.
There are commonly two types of SIMD dependencies that exist in the PS unit of the GPU:                A number of pixel processing elements (PEs) share a single control unit. Although logically each PE can execute a separate control path through the program, including conditionals and loops, the shared control unit has to decode and emit the aggregate (set union) of all the instructions required for all control paths taken by all the dependent PEs. For example, if one PE iterates a loop 15 times and then takes branch A of a conditional, while another PE iterates 10 times and then takes branch B, the control unit has to emit instructions for 15 loop iterations, branch A, and branch B. The first PE will be idle during the processing of Branch B. The second PE will be idle during the last 5 loop iterations and during branch A.        A number of PEs are arranged so that they can process pixels that are geometrically adjacent to each other as a two-dimensional (2D) tile of the destination image buffer. For example, 24 PEs may be arranged to process a 6×4 tile of the destination image buffer. If a polygon obliquely intersects the 6×4 pixel tile, so that it covers only three pixels, the cluster of 24 PEs will have to execute the shader for that pixel. If there are no adjacent polygons covering the other 21 pixels, 21 of the PEs will stay idle for the duration of the pixel shader program.        
FIG. 1 shows this schematically for the case of a GPU in which a 6×4 array or “tile” of PEs share one control unit. Some GPUs only have one control unit and PE tile, whereas other GPUs have multiple control units and PE tiles.
When a graphics card with a GPU is being used for its intended uses, the limitations imposed by the SIMD dependencies are acceptable. This is because the intended applications, such as to render large polygons, or large meshes of small polygons, cover a large area of the destination image buffer. Given adequate performance of the rasterization unit, reasonable locality of the polygons in the mesh, and an adequately large buffer for assigning rasterized polygon fragments to PEs, all PEs can be well utilized.
FIG. 2 illustrates how part of a large polygon mesh might map to an arrangement of 6×4 PEs. Each dot represents a pixel centre of the destination image buffer, rendered by one PE. Clearly, large polygon meshes can achieve good PE utilisation.
A further reason why the SIMD limitations are acceptable for typical polygon rendering applications is that the pixel shader actions are the same or very similar for all pixels covered by a polygon mesh. Thus, in the example above, it is expected that all 24 PEs will be executing the same instructions for a large fraction of their shader programs, and that the proportion of PE idle time will be low.
The present invention is based on the premise that it would be desirable to harness the processing power available in a GPU to accelerate the volume rendering process. This is not a new idea.
Although not originally designed with this use in mind, GPUs have sufficiently general programmability that they can be applied to the task of volume rendering, in particular, to volume rendering in medicine, where the task is usually to render images of the internal organs of human patients. However, when applied to volume rendering in medicine, the SIMD limitations of GPUs discussed above will tend to have a strong detrimental effect on performance for the following reasons:                Although CT, MR, and PET scanners scan a large section of the body, the typical display requirement is to show only certain organs, such as blood vessels, kidneys, the skeleton, etc. These organs occupy a small fraction of the space inside the patient. Thus, volume rendering for medical applications is best suited to selective, sparse processing of the volume. However, the SIMD architecture of GPUs is best suited to uniform processing of the whole volume.        Volume rendering algorithms require the processing of a large number of samples per pixel, roughly proportional to the depth of the ray that is cast through the pixel to sample the volume. Given the sparse nature of the volume, some pixels will require many depth samples to be processed, while others will require few or none. The SIMD limitations means that the overall processing time of a tile of pixels, such as the 6×4 tile described in the example, is the time needed to process the longest ray in the tile.        
FIG. 3 shows an image of a typical volume-rendered image of a patient's kidney and associated vessels. In this figure, the rendering parameters have been set to display only the kidneys, vessels, and skeleton, making all the other material transparent. Clearly, certain pixels (A, B, C) display only transparent space, while others (D to J) display tissue. Of the pixels that display tissue, the depth of tissue that must be sampled for each pixel varies considerably. Rays cast through pixels D and E hit a relatively thin section of tissue. Rays cast through F and G hit a thick section of tissue, which must all be processed because the rendering parameters define it as partially transparent. Rays H and J hit a thick section of vessel or kidney but, because vessel and kidney are displayed as opaque, only the surface samples need to be processed.
Thus the ranking of rays by decreasing depth of tissue that needs to be processed is roughly as follows: F, G, D, E, H, J, A, B, C. This image has been created for illustration and the example pixels are much further apart than the 6×4 or similar tile that the GPU is constrained to render in SIMD mode. However, a similar variability of the depth of rays that need to be processed occurs at the 6×4 or similar tile scale, and thus the SIMD limitations of the GPU degrade the performance of the volume rendering application. An example of highly local variability of the ray depth would be a projection of a blood vessel.
The object of the present invention is to circumvent the SIMD limitations of GPUs and achieve more efficient rendering of sparse volume data.