Graphics processing units (GPUs) and other vector processors typically employ a plurality of compute units, each having one or more arithmetic logic units (ALUs), to execute corresponding plurality of threads of a shader or other compute kernel in parallel. Each compute unit provides a set of physical general purpose registers (GPRs) that can be allocated to threads for use during execution of the thread at the compute unit. However, each physical GPR implemented in a compute unit consumes a corresponding amount of power. More complex shaders or kernels often require a large number of GPRs, and thus ensuring that a sufficient number of physical GPRs is available to support such complex compute kernels can result in excessive power consumption, as well as require considerable die floorspace to implement. Conversely, if a lower number of GPRs is used in view of power and floorspace limitations, the processor typically is correspondingly limited in the number of threads that can be executed in parallel, which can lead to relatively low ALU occupancy.