Ray-tracing is a technique for generating images by simulating the behavior of light within a three-dimensional scene by typically tracing light rays from the camera into the scene, as depicted in FIG. 1A. In general two types of rays are used. Primary rays are traced from a particular point on the camera image plane (a pixel) into the scene, until they hit a surface, at a so-called hit or intersection point. Shadow and secondary rays are traced from a hit point to determine how it is lit. Finally, to determine how the surface material appears texture lookups and shading computations are performed at or near the hit point. FIG. 1B shows a scene having three objects and single light source. Three ray generations are created when the primary ray spawns other rays (N′ surface normal, R′ reflected ray, L′ shadow ray, T′ transmitted (refracted) ray).
Ray tracing is a high computationally expensive algorithm. Fortunately, ray tracing is quite easy to parallelize. The contribution of each ray to the final image can be computed independently from the other rays. For this reason, there has been a lot of effort put into finding the best parallel decomposition for ray tracing.
There are two main approaches to parallel ray-tracing in prior art: (i) ray-parallel, in which rays are distributed among parallel processors, while each processor traces a ray all the way, and (ii) data-parallel, in which the scene is distributed among multiple processors, while a ray is handled by multiple processors in a row.
The ray-parallel implementation of ray tracing would simply replicate all the data with each processor and subdivide the screen into a number of disjunct regions. Each processor then renders a number of regions using the unaltered sequential version of the ray tracing algorithm, until the whole image is completed. Whenever a processor finishes a region, it asks the master processor for a new task. This is also called the demand driven approach, or an image space subdivision. Load balancing is achieved dynamically by sending new tasks to processors that have just become idle. However, if very large models need to be rendered, the scene data have to be distributed over the memories, because the local memory of each processor is not large enough to hold the entire scene. Then demand driven approach suffers from massive copies and multiplications of geometric data.
Data-parallel is a different approach to rendering scenes that do not fit into a single processor's memory. Here, the object data is distributed over the processors. Each processor owns only a subset of the database and it traces rays only when they pass through its own subspace. Its high data locality excludes massive moves of data, answering the needs of very large models. However, rendering cost per ray and the number of rays passing through each subset of database are likely to vary (e.g. hot spots are caused by viewpoints and light sources), leading to severe load imbalances, a problem which is difficult to solve either with static or dynamic load balancing schemes. Efficiency thus tends to be low in such systems.
In order to exploit locality between data accesses as much as possible, usually some spatial subdivision is used to decide which parts of the scene are stored with which processor. In its simplest form, the data is distributed according to a uniform distribution. Each processor will hold one or more equal sized voxels (volume elements). Having just one voxel per processor allows the data decomposition to be nicely mapped onto a 3D grid topology. However, since the number of objects may vary dramatically from voxel to voxel, the cost of tracing a ray through each of these voxels will vary and therefore this approach may lead to severe load imbalances.
A second, and more difficult problem to address, is the fact that the number of rays passing through each voxel is likely to vary. Certain parts of the scene attract more rays than other parts. This has mainly to do with the view point and the location of the light sources. Both the variations in cost per ray and the number of rays passing through each voxel indicate that having multiple voxels per processor is a good option, as it is likely to balance the workload, albeit at the cost of extra communication.
The way the data is distributed over processors has a strong impact on how well the system performs. The more even the workload associated with a particular data distribution, the less idle time is to be expected. Three main criteria need to be observed for such distributions to lead to efficient execution of the parallel algorithm (Salmon and Goldsmith):                The memory overhead for each processor should be as equal as possible        Communication requirements during rendering need to be minimized        Processing time for each processor needs to be equalized        
Generating data distributions which adhere to all three criteria is a difficult problem, which currently remains unsolved in prior art. Most data distributions are limited to equalizing the memory overhead for each processor. This is a relatively simple exercise, because generating an adaptive spatial subdivision, such as an octree or KD-tree, gives sufficient clues as to which regions of space contain how many objects.
Another problem in ray tracing is the high processing cost of acceleration structures. For each frame, a rendering system must find the intersection points between many rays and many polygons. The cost of testing each ray against each polygon is prohibitive, so such systems typically use accelerating structures (such as Uniform grid, Octree, KD-tree, other binary trees, bounding boxes, etc.) to reduce the number of ray/polygon intersection tests that must be performed. As the data is sorted over space with the acceleration structure, the data distribution over the processors is based on this structure as well. The spatial subdivision is also used to establish which data needs to be fetched from other processors. Moreover, construction of optimized structures is expensive and does not allow for rebuilding the accelerating structure every frame to support for interactive ray-tracing of large dynamic scenes. The construction times for larger scenes are very high and do not allow dynamic changes.
There has been an attempt in prior art to lower the cost and complexity of acceleration structures by using its simplest form, where the data is distributed uniformly. Each processor will hold one or more equal sized voxels. Having just one voxel per processor allows the data decomposition to be nicely mapped onto a 3D grid topology. However, since the number of objects may vary dramatically from voxel to voxel, the cost of tracing a ray through each of these voxels will vary and therefore this approach leads to severe load imbalances, and consequently the uniform distribution has been abandoned.
Today, the most popular data structure in prior art is the KD-tree. Ray traversal in a KD-tree is less efficient than in uniform grid. However, it is good for the scenes with non-uniform distribution of objects. The massive traversal of accelerating structure based on KD-tree typically consumes major chunk of the frame time (e.g. 63% in average, Venkatraman et al.). Thus, there is a great need in the art to devise a method of improved load balancing leaned on simple acceleration structure, such as uniform grid.