Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data. However, more recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data.
Execution units (EUs) are typically implemented in graphics processors to execute one or more threads to perform shading operations (e.g., vertex and pixel shaders). However, selecting a number of physical registers per thread in the design of an EU presents a trade-off between die area and performance. For instance, having fewer registers can reduce the area overhead at the expense of an increased number of spill-fill operations, which can introduce large latencies and memory bottlenecks.
Due to the significant performance impact of spill-fill operations, a large number of registers per thread are typically provided to avoid such performance pitfalls. If the worst case register usage for a kernel is determined to be significantly smaller than the available register space, either the logical single instruction multiple data (SIMD) width or a number of threads can be increased for better utilization of the available registers. However, the number of live registers often varies dynamically during the execution of a program, with a few hotspots of large register usage. In such scenarios, allocating registers for the worst case usage can result in a poor utilization of die area.