The disclosed subject matter relates generally to computer systems and, more particularly, to a method and apparatus for batching memory requests.
Memory structures, or memory, such as Random Access Memories (RAMs), Static RAMs (SRAMs), Dynamic RAMs (DRAMs) and various levels of cache, have evolved to require increasingly faster and more efficient accesses. As memory technologies have increased in speed and usage, management of memory devices has increased in complexity. Increased demands on system performance coupled with memory management complexity now require efficient, stream-lined memory utilization.
As the number of cores continues to increase in modern chip multiprocessor (CMP) systems, the DRAM memory system is becoming a critical shared resource. Memory requests from multiple central processing unit (CPU) cores interfere with each other, and this inter-application interference is a significant impediment to individual application and overall system performance. Conventional memory controllers have attempted to address the problem by making the memory controller aware of application characteristics and appropriately prioritizing memory requests to improve system performance and fairness.
Recent computer systems present an additional challenge by introducing integrated graphics processing units (GPUs) on the same die with CPU cores. GPU applications typically demand significantly more memory bandwidth than CPU applications due to the GPU's capability of executing a large number of parallel threads. GPUs use single-instruction multiple-data (SIMD) pipelines to concurrently execute multiple threads, where a group of threads running the same instruction is called a wavefront or warp. When a wavefront stalls on a memory instruction, the GPU core hides this memory access latency by switching to another wavefront to avoid stalling the pipeline. Therefore, there can be thousands of outstanding memory requests from across all of the wavefronts. This approach is fundamentally more memory intensive than CPU memory traffic, where each CPU application has a much smaller number of outstanding requests due to the sequential execution model of CPUs.
Previous memory scheduling research has focused on memory interference between applications in CPU-only scenarios. These past approaches are built around a single centralized request buffer at each memory controller (MC). The scheduling algorithm implemented in the MC analyzes the stream of requests in the centralized request buffer to determine application memory characteristics, decides on a priority for each core, and then enforces these priorities. Observable memory characteristics may include the number of requests that result in row-buffer hits, the bank-level parallelism of each core, memory request rates, overall fairness metrics, and other information.
FIG. 1 illustrates memory request scheduling for a request buffer that is shared between a CPU core and a GPU. A conventional structure of the memory scheduler in a memory controller contains a request queue, which stores a list of requests from various hosts sharing the memory. The memory scheduler selects the “best” memory request to service, depending on the memory scheduler algorithm. For example, in FIG. 1, a CPU queue 100 includes four requests A, B and C from the CPU all going to the same page/row. For purposes of illustration, assume there is a time interval between the requests. A GPU queue 110 includes requests W, X, Y, Z that are directed to the same page/row as each other but different than the page/row for requests A, B, and C. Assuming that the current open page is at the same page/row as request A, the memory scheduler will service request A first in the memory controller queue 120, as it is a row hit, which takes less time to process. The memory scheduler then services request W, which will change the current open page to page W and incur a row miss (represented by the shaded block for request W. Requests X, Y and Z are then serviced, since they all hit in the same (now open) row as Request W. While this increases the total number of row buffer hits in the system, it significantly delays the service of Request B (which is also a row buffer miss). Overall, both the CPU and the GPU suffer significant slowdowns compared to a case when they run by themselves without any interference.
The large volume of requests from the GPU occupies a significant fraction of the request buffer, thereby limiting the visibility of the CPU applications' memory behaviors. One possible scenario is when the memory channel is shared by several CPUs, some of which are memory intensive and some are not, and the memory channel is also shared with the GPU. In this scenario, the GPU and memory intensive applications from the CPU will dispatch many memory requests to the memory scheduler. However, these requests generally have more tolerance to memory latency because even though one request is serviced, there are other outstanding requests that halt the progress of the application. In contrast, the applications that are not memory intensive, which are sensitive to any extra memory latency, will not be able to inject their requests into the request queue. From the memory scheduler perspective, there are fewer requests from the CPU in the request buffers, while most of the entries are from the GPU. As a result, the memory scheduler does not have much ability to select the best requests from the pool of CPU requests to quickly service the low-intensity CPU request, increasing the slowdown of the system. This effect results in significant performance degradation for applications that are not memory intensive.
To allow the memory scheduler to schedule these requests effectively, the size of the request queue needs to be significantly larger. The increased request buffer size allows the MC to observe more requests from the CPUs to better characterize their memory behavior. For instance, with a large request buffer, the MC can identify and service multiple requests from one CPU core to the same row such that they become row-buffer hits, however, with a small request buffer, the MC may not even see these requests at the same time because the GPU's requests have occupied the majority of the entries. Very large request buffers impose significant implementation challenges including the die area for the larger structures and the additional circuit complexity for analyzing so many requests, along with the logic needed for assignment and enforcement of priorities. Building a very large, centralized MC request buffer is unattractive due to the resulting area, power, timing and complexity costs.
This section of this document is intended to introduce various aspects of art that may be related to various aspects of the disclosed subject matter described and/or claimed below. This section provides background information to facilitate a better understanding of the various aspects of the disclosed subject matter. It should be understood that the statements in this section of this document are to be read in this light, and not as admissions of prior art. The disclosed subject matter is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth above.