Today's high performance processors generally include multithreaded parallel processing cores that perform well on a wide range of computation-intensive applications. For example, a graphics processing unit (GPU) includes parallel computing hardware for graphics applications. Theoretically, the GPU performance is a product of two factors: number of floating-point units (FPUs) and the inherent parallelism present in the application. Major advancement in the semiconductor process technology, such as the continued miniaturization of CMOS devices, has produced faster and smaller transistors, enabling massive number of FPUs in a single GPU. Further, this large number of FPUs has provided the software programmer with substrate to rapidly solve complex problems that have considerable parallelism. These trends have significantly increase the processor performance, enabling leaps in software functionality and making the processor a ubiquitous commodity.
Unfortunately, there are various factors that can contribute to less-than-optimal performance of parallel computing devices, such as GPUs or general purpose processors. One such factor is the design of the memory system that may fall short in providing the matching bandwidth (data throughput) required by the high computation needs of the processor. The conventional solution is to organize memory system as multiple memory modules (banks) that can be accessed in parallel; i.e., interleaved memory. If the memory access pattern is uniformly distributed among all the modules, then full bandwidth of the memory system can be achieved and the design problem can be translated by just increasing the number of memory modules to match the GPU requirements. On the other hand, if the access pattern is not uniformly distributed, there can be significant decrease in the performance due to contention.
Graphics and general purpose scientific applications typically include computations such as matrix operations on dense or sparse matrices, interpolation, convolution, Fast Fourier Transforms, table lookups, etc. These applications tend to generate interleaved streams of access patterns that either contains constant strides or a structured pattern of strides. Moreover, these applications can also generate unordered access patterns that seem random. Thus, there is a need for an interleaving memory system that avoids conflicts and is capable of providing high bandwidth across access patterns.