Processor cores of current graphics processing units (GPUs) are highly parallel multiprocessors that execute numerous threads concurrently. Furthermore, threads of such processors are often packed together into groups, called warps, which are executed in a single instruction multiple data (SIMD) fashion. At any one instant, all threads within a warp may be nominally applying precisely the same instruction, each to its own private data values. If the processing unit is executing an instruction that some threads do not want to execute (e.g. due to conditional statement, etc.), those threads are idle. This condition, known as divergence, should be carefully avoided as idling threads are unutilized, thus reducing, total computational throughput.
Many applications, at some point, traverse a data structure (e.g. a search tree, etc.) in response to a query. In some cases, data records are stored only at the leaf nodes (e.g. nodes with no corresponding child nodes, etc.). The rest of the nodes in the search tree are called inner nodes. This restriction is common in B+ trees used in database and file systems, as well as in ray tracing hierarchies in the context of ray tracing.
For example, given a ray in space, it is desirable to discover which (if any) object (e.g. a geometric primitive or a group of primitives) in a scene is first intersected by the ray. In some cases, these geometric primitives (e.g. points, lines, triangles, etc.) may be organized in a tree, such as a bounding volume hierarchy (BVH), k-dimensional (kd) tree, or a binary space partitioning (BSP) tree. After the geometric primitives are organized in such tree, ray tracing involves traversing the tree, searching, for the leaf node or nodes that are intersected by a given ray. When such leaf node or nodes are found, the ray may be intersected against the primitives contained by the node or nodes.
A tree structure may be organized so that also inner nodes may contain primitives. In this case, inner nodes differ from leaf nodes only in that they contain other nodes. The nodes in this kind of tree structure may be processed in the same order as in a tree structure that may only contain primitives in leaf nodes. However, the ray may be intersected against primitives in both inner and leaf nodes.
Furthermore, the geometric primitives may be organized in a grid structure that may be traversed for determining ray-node intersections. In this case, each grid cell has a list of primitives that at least partially overlap the cell. The list may be empty if no primitive overlaps the cell. The traversal of grid acceleration structure includes finding the cell that contains the ray origin and stepping from cell to adjacent cells along the ray. When a cell that contains primitives is encountered, the ray may be intersected against the primitives contained by the cell.
Regardless of the particular type of structure used for organizing the primitives, there are two basic operations that need to be executed during the tracing of a ray. The first operation is node traversal, which typically includes intersecting the ray against one, two, or another predetermined number of nodes and choosing the node to be considered next. By repeated application of a node traversal operation, node or nodes that may contain primitives that the ray intersects may be found. The second operation is primitive intersection, i.e. intersecting the ray against the primitives in a node found during traversal. The execution of a ray tracing algorithm includes repeated application of these two operations in some order.
When formulating ray tracing algorithms on a highly parallel architecture such as a GPU, it is important to determine how rays and traversal tests are assigned to the various parallel threads of execution included in the parallel architecture. In particular, it is important to design a system to minimize divergence due to different threads in a warp making different decisions.
Various prior art techniques provide ways which allow rays to traverse a data tree independently. As a result, each ray visits only the nodes it actually intersects, such that redundant work is avoided. At any given time in a SIMD architecture, however, the entire warp has to be executing node traversal or primitive intersection due to the SIMD execution. This causes execution type penalties. For example, if node traversal is chosen to be executed, the threads that currently require primitive intersection to be executed will have to remain idle.
Another performance issue is the termination penalty due to threads in a packet of threads terminating prior to other threads of the packet. Here, a packet refers to one or more warps. For example, threads of a packet may terminate processing only when all its rays have terminated. Thus, threads in the packet may remain idle waiting for other threads in the packet to complete processing. In some cases, this termination penalty may be significant.
There is thus a need for addressing these and/or other issues associated with the prior art.