Many conventional processors can simultaneously execute more than one thread on the same chip (e.g., chip-multiprocessors or multi-core processors, symmetric shared-memory multiprocessors, simultaneous multithreading processors). In these systems, the memory system (e.g., DRAM) is shared among the threads concurrently executing on different processing cores or different execution contexts. The memory controller receives requests from different threads and attempts to schedule the requests. Current memory controllers try to schedule requests such that the data throughput obtained from the memory is maximized. However, blindly maximizing the data throughput obtained from the memory system ignores the latency experienced by each individual thread by ignoring the parallelism of requests generated by each thread.
When a thread is executed in a conventional processor, the thread can generate multiple concurrent memory requests due to techniques such as out-of-order instruction execution, data prefetching, or run-ahead execution. If these requests are to different banks in the memory system, the requests can be serviced in parallel. If the concurrent memory requests from a thread are serviced in parallel, the associated memory access latencies are overlapped and the processor is stalled only as if it were waiting for a single memory access (to a first approximation). For example, if access to a memory bank takes M cycles, five concurrent accesses to different memory banks can take M cycles (or not significantly more), since different memory banks can be accessed concurrently. In contrast, if the processor generates the five memory requests serially (one request only after the previous request was complete) or all requests go to the same bank, then no parallelism is possible and the processor needs to wait for 5*M cycles until all requests are serviced.
Accordingly, the latter technique for generating memory requests significantly reduces performance since it takes the processor longer to do the same amount of work. For this reason, conventional processors employ sophisticated techniques as mentioned above (e.g., out-of-order execution, run-ahead execution, and prefetching) to generate memory requests concurrently.
If a thread that generates concurrent requests is the only thread running in the system, existing memory controllers can schedule those requests efficiently in parallel. Unfortunately, if multiple threads are generating memory requests concurrently (e.g., in a multi-core processor system), the memory controller can schedule the outstanding requests according to one mainstream scheduling technique called FR-FCFS (which schedules requests on a first-ready first-come-first-serve basis). FR-FCFS completely ignores the fact that servicing the outstanding memory requests from a thread in parallel can result in much smaller stall-time at the application layer than servicing the requests serially (one after another). Hence, a thread having requests that interfere with other thread requests can be significantly delayed in the memory system because the associated memory accesses could be scheduled serially by the memory controller. However, the associated memory accesses would be serviced in parallel had the thread been executed by itself or had the memory controller been aware of the parallelism among a thread's requests.