Some conventional processors leverage massive multithreading as a technique for hiding latency and achieving high performance. Regularly structured, compute-intensive applications can readily utilize the high peak memory bandwidth and ample computational resources of a graphics processing unit (GPU) to great effect. In particular, regularly structured applications with high spatial and temporal locality can efficiently utilize cache resources. However, not all applications can be re-factored to exhibit regular control flow and memory access patterns, and many emerging GPU applications suffer from inefficient utilization of cache resources. Specifically, applications can suffer from cache thrashing due to large thread count, small cache sizes, and limited cache capacity per thread.
When the massively multithreaded nature of GPUs is combined with irregular memory access patterns, little effective cache capacity may be available per thread, resulting in high cache miss rates and reducing the amount of temporal locality that can be exploited. Such behavior often results in low reuse of cache blocks, both temporally and spatially, and may waste memory bandwidth, on-chip storage, and dynamic random access memory (DRAM) power. Thus, there is a need for addressing this issue and/or other issues associated with the prior art.