Graphics processing units (GPUs) and other multithreaded processing units typically implement multiple processing elements (which are also referred to as processor cores or compute units) that concurrently execute multiple instances of a single program on multiple data sets. The instances are referred to as threads or waves. Several waves are created (or spawned) and then dispatched to each processing element in a multi-threaded processing unit. The processing unit can include hundreds of processing elements so that thousands of waves are concurrently executing programs in the processing unit. The processing elements in a GPU typically process three-dimensional (3-D) graphics using a graphics pipeline formed of a sequence of programmable shaders and fixed-function hardware blocks. For example, a 3-D model of an object that is visible in a frame can be represented by a set of primitives such as triangles, other polygons, or patches which are processed in the graphics pipeline to produce values of pixels for display to a user. In a multithreaded GPU, the waves execute different instances of the shaders to perform calculations on different primitives concurrently or in parallel. Waves that are executing concurrently in a multithreaded processing unit share some of the resources of the processing unit. Shared resources include vector general-purpose registers (VGPRs) that store state information for the waves, local data shares (LDSs) that are used to store data for the waves, bandwidth available to move information between a local cache hierarchy and a memory, and the like.