This disclosure relates to the field of data processing systems.
Modern processors, such as graphics processing units (GPUs), utilize a high degree of multithreading in order to overlap memory accesses with computation. Using the Single Instruction Multiple Thread (SIMT) model, GPUs group many threads that perform the same operations on different data into warps (group of threads) and a warp scheduler can attempt to swap warps that are waiting for memory accesses for those that are ready for computation. However, programs frequently lack enough computation to hide the long latency of off-chip memory accesses. This is one of the challenges that prevents achieving peak performance on these architectures. When a memory intensive workload needs to gather more data from DRAM for its computations, the maximum number of outstanding requests that can be handled on-chip (including at the L1, L2 and memory subsystem buffers) becomes saturated. Due to this saturation, a new memory request can only be sent when older memory requests complete, and subsequent accesses to the memory can no longer be pipelined resulting in serialized accesses. In such a scenario, the computation portion of any of the parallel warps cannot begin, as all the memory requests needed to initiate the computation have been serialized. Therefore the amount of computation is not sufficient to hide the latency of the unpipelined memory requests to be filed, and the workloads cannot achieve high throughput.
This problem of memory intensive applications saturating memory subsystem resources is exacerbated by uncoalesced memory accesses and irregular access patterns. Such accesses lead to increased cache thrashing, which will force more DRAM memory requests to be issued, thus worsening the serialization of memory accesses. While ideal GPU workloads tend to have very regular, streaming memory access patterns, recent research has examined GPU applications that benefit from cache locality and have more non-streaming accesses. If this data locality is not exploited, cache thrashing will occur causing performance degradation.