Ray-tracing is a technique for generating images by simulating the behavior of light within a three-dimensional scene by typically tracing light rays from the camera into the scene, as depicted in FIG. 1. In general two types of rays are used. The ray that comes from the screen or viewer's eye (aka point of view) is called the primary ray. Tracing and processing the primary ray is called primary ray shooting, or just ray shooting. If the primary ray hits an object, at the primary point of intersection, the light may bounce from the surface of the object. We call these rays secondary rays. Primary rays are traced from a particular point on the camera image plane (a pixel) into the scene, until they hit a surface, at a so-called hit or primary intersection point. Shadow rays and secondary rays are traced from a hit point to determine how it is lit. The origin of a shadow ray is on the surface of an object and it is directed towards the light sources. If the ray hits any object before it reaches any light source, the point located at the ray origin is in the shadow and should be assigned a dark color. Processing the shadow ray is called shadowing. Finally, to determine how the surface material appears texture lookups and shading computations are performed at or near the hit point. FIG. 2 shows a scene having three objects and a single light source. Three ray generations are created when the primary ray spawns other rays (N′ surface normal, R′ reflected ray, L′ shadow ray, T′ transmitted (refracted) ray).
Ray tracing is a high computationally expensive algorithm. Fortunately, ray tracing is quite easy to parallelize. The contribution of each ray to the final image can be computed independently from the other rays. For this reason, there has been a lot of effort put into finding the best parallel decomposition for ray tracing. There are two main approaches in prior art to the parallel ray-tracing: (i) ray-parallel, in which rays are distributed among parallel processors, while each processor traces a ray all the way, and (ii) data-parallel, in which the scene is distributed among multiple processors, while a ray is handled by multiple processors in a row.
The ray-parallel implementation of ray tracing would simply replicate all the data with each processor and subdivide the screen into a number of disjunct regions. Each processor then renders a number of regions using the unaltered sequential version of the ray tracing algorithm, until the whole image is completed. Whenever a processor finishes a region, it asks the master processor for a new task. This is also called the demand driven approach, or an image space subdivision. Load balancing is achieved dynamically by sending new tasks to processors that have just become idle. However, if a very large models need to be rendered, the scene data have to be distributed over the memories, because the local memory of each processor is not large enough to hold the entire scene. Then demand driven approach suffers from massive copies and multiplications of geometric data.
Data-parallel is a different approach to rendering scenes that do not fit into a single processor's memory. Here, the object data is distributed over the processors. Each processor owns only a subset of the database and it traces rays only when they pass through its own subspace. Its high data locality excludes massive moves of data, answering the needs of very large models. However, rendering cost per ray and the number of rays passing through each subset of database are likely to vary (e.g. hot spots are caused by viewpoints and light sources), leading to severe load imbalances, a problem which is difficult to solve either with static or dynamic load balancing schemes. Efficiency thus tends to be low in such systems.
In order to exploit locality between data accesses as much as possible, usually some spatial subdivision is used to decide which parts of the scene are stored with which processor. In its simplest form, the data is distributed according to a uniform distribution. Each processor will hold one or more equal sized voxels. Having just one voxel per processor allows the data decomposition to be nicely mapped onto a 3D grid topology. However, since the number of objects may vary dramatically from voxel to voxel, the cost of tracing a ray through each of these voxels will vary and therefore this approach may lead to severe load imbalances.
The way the data is distributed over processors has a strong impact on how well the system performs. The more even the workload associated with a particular data distribution, the less idle time is to be expected. Three main criteria need to be observed for such distributions to lead to efficient execution of the parallel algorithm (Salmon and Goldsmith): (i) The memory overhead for each processor should be as equal as possible. (ii) Communication requirements during rendering need to be minimized. (iii) Processing time for each processor needs to be equalized.
Generating data distributions which adhere to all three criteria is a difficult problem, which remains unsolved in prior art. Most data distributions are limited to equalizing the memory overhead for each processor. This is a relatively simple exercise, because generating an adaptive spatial subdivision, such as an octree or KD-tree, gives sufficient clues as to which regions of space contain how many objects.
Another problem in ray tracing is the high processing cost of acceleration structures. For each frame, a rendering system must find the intersection points between many rays and many polygons. The cost of testing each ray against each polygon is prohibitive, so such systems typically use accelerating structures (such as Octree, KD-tree, other binary trees, bounding boxes, etc.) to reduce the number of ray/polygon intersection tests that must be performed. As the data is sorted over space with the acceleration structure, the data distribution over the processors is based on this structure as well. The spatial subdivision is also used to establish which data needs to be fetched from other processors. Moreover, construction of optimized structures is expensive and does not allow for rebuilding the accelerating structure every frame to support for interactive ray-tracing of large dynamic scenes. The construction times for larger scenes are very high and do not allow dynamic changes.
There has been an attempt in prior art to lower the cost and complexity of acceleration structures by using its simplest form, where the data is distributed uniformly. Each processor will hold one or more equal sized voxels. Having just one voxel per processor allows the data decomposition to be nicely mapped onto a 3D grid topology. However, since the number of objects may vary dramatically from voxel to voxel, the cost of tracing a ray through each of these voxels will vary and therefore this approach leads to severe load imbalances, and consequently the uniform distribution has been abandoned.
Today, the most popular data structure in prior art is the KD-tree. Ray traversal in a KD-tree is particularly efficient for scenes with non-uniform distribution of objects. The massive traversal of accelerating structure based on KD-tree typically consumes major chunk of the frame time. The ray-object intersection tests of prior art are considered as the heaviest part of ray tracing due to extensive traversal across the accelerating data structures and massive memory access. Thus, there is a great need in the art to devise a method of improved load balancing, reduced traversals leaned on simple data structure, and reduced amount of intersection tests.