Hardware prefetching is not commonly employed in Graphics Processing Units (GPUs), even though they may have superior spatial locality, because demand requests may saturate the available memory bandwidth of planar, two-dimensional (2D) memory. However, prefetching is useful, for example, with die-stacked memory, which has much higher memory bandwidth at lower per-access energy cost than 2D memory.
However, GPUs consume a large amount of memory bandwidth, (e.g., with demand requests), which may result in large queuing latency at the memory controllers, thereby affecting the GPU performance. The introduction of stacked dynamic random access memory (DRAM) may address this issue and facilitate hardware prefetching for GPUs. However, since memory bandwidth is a shared, finite resource even with stacked memory, it may be desirable to issue prefetches only when the bandwidth is available or is not heavily utilized by demand requests.