Current parallel graphics data processing may include systems and methods developed to perforin specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data. More recently, however, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data. For example, currently “bounding volumes” (“BVs”, “convex hull”) of a three-dimensional (3D) object may be computed by an application in a central processing unit (CPU) as input to a number of rendering schemes.
To further increase performance, graphics processors may typically implement processing techniques such as pipelining that attempt to process, in parallel, as much graphics data as possible throughout the different parts of the graphics pipeline. Parallel graphics processors with single instruction, multiple thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In an SIMT architecture, groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency. These solutions rely on host processor synchronization, which may limit performance such as the amount of parallel processing possible.