In recent years, general-purpose GPU computing has given rise to a number of methods for constructing bounding volume hierarchies (BVHs), octrees, and k-d trees for millions of primitives in real-time. Some methods aim to maximize the quality of the resulting tree using the surface area heuristic, while others choose to trade tree quality for increased construction speed.
The right quality vs. speed tradeoff depends heavily on the application. Tree quality is usually preferable in ray tracing where the same acceleration structure is often reused for millions of rays. Broad-phase collision detection and particle interaction in real-time physics represent the other extreme, where construction speed is of primary importance—the acceleration structure has to be reconstructed on every time step, and the number of queries is usually fairly small. Furthermore, certain applications, such as voxel-based global illumination and surface reconstruction, specifically rely on regular octrees and k-d trees, where tree quality is fixed.
The main shortcoming with existing methods that aim to maximize construction speed is that they generate the node hierarchy in a sequential fashion, usually one level at a time, since each round of processing has to complete before the next one can begin. This limits the amount of parallelism that they can achieve at the top levels of the tree, and can lead to serious underutilization of the parallel cores. The sequential processing is already a bottleneck with small workloads on current GPUs, which require tens of thousands of independent parallel threads to fully utilize their computing power. The problem can be expected to become even more significant in the future as the number of parallel cores keeps increasing. Another implication of sequential processing is that the existing methods output the hierarchy in a breadth-first order, even though a depth-first order would usually be preferable considering data locality and cache hit rates.