Graphics processing units (GPUs) are highly threaded machines in which hundreds of threads of a program are executed in parallel to achieve high throughput. In some implementations, threads may be organized into a collection (e.g., 7 or 8 threads) depending on generation, and are assigned to one of several Execution Unit (EUs) within the GPU. An EU typically executes instructions from one active thread at a time. Whenever a currently active thread on an EU causes a stall, the GPU hardware switches to another thread that is ready to execute instructions, making it the active thread. This process is referred to as thread switching, and helps hide memory access latencies.
Traditionally, GPU hardware provides constant storage per thread for a register file. Compilers use register file size as part of a machine description structure. Optimizations and register sharing are performed by the compiler using this statically known constant register file size for the target architecture. For instance, each thread may have 128 general purpose registers in a General Purpose Register file (GRF) allocated statically. If a program's register usage exceeds 128 GRFs, the compiler generates spills by assigning slots in memory as home location for variables that fail to get an allocation. Thus in a reduced instruction set computer (RISC) machine, the compiler inserts a memory load (or fill) just before a spilled variable is used and a memory write (or spill) after it is written. The likelihood of stalls due to cache misses are high whenever fills are very close to the use. Thus, very little useful execution is performed whenever there are many fills in a program, resulting in the machine spending more time stalling.