A current goal for designers of high-performance computing (“HPC”) systems is to reach exascale computing, that is, exascale floating-point operations per second (“exaFLOPS”). To achieve exascale computing, designers envision an exascale computing system with many nodes, each of which has many cores. The use of many cores per node allows for increased performance through parallelization. Unfortunately, many application programs are constrained by limited memory bandwidth, even with many fewer cores in a node. As a result of the limited memory bandwidth, the memory read requests of the application programs are queued at the core, and the application programs stall while waiting for the queued read requests to be processed. One reason that the read requests are queued is that the cache into which the data is to be stored has no available outstanding request buffer (“ORB”). Whenever a memory request is to be sent to memory, an ORB is allocated to support issuing the memory request and receiving the corresponding response. If all the ORBs for a cache are allocated, the subsequent memory requests need to be queued pending deallocation of an ORB.
High-bandwidth memory (“HBM”) has the potential of allowing such application programs to execute without incurring significant delays due to stalling while waiting for queued memory read requests. HBM achieves higher bandwidth while using less power in a substantially smaller form factor than other memory techniques. The high bandwidth is achieved by stacking up to eight DRAM dies, which may include a base die with a memory controller. The memory bus of an HBM memory is very wide in comparison to other DRAM memories. An HBM stack of four DRAM dies may have two 128-bit channels per die for a total of eight channels and a total width of 1024 bits. Examples of HBM include the High-Bandwidth Memory provided by Advanced Micro Devices, Inc. and the Hybrid Memory Cube provided by Micron Technology, Inc.
Unfortunately, even with HBM, significant queuing delays can still occur because application programs that execute on HPC systems commonly execute a large number of vector or single-instruction-multiple-data (“SIMD”) instructions. These queuing delays can be very significant with certain memory access patterns such as a gather operation in which the consecutive elements of the vector or array are not consecutive in memory. Although HBM is theoretically capable of supporting such application programs without significant queuing delays, the number of ORBs associated with a cache can present a bottleneck, resulting in significant queue delays. Thus, if an application program has a memory access pattern that is optimally supported by 24 ORBs, but the cache has only 10 ORBs, significant queueing delays can occur. Current computer architectures typically have 10 ORBs for an L1 cache. Even application programs that do not perform gather operations may still incur significant queuing delays resulting from not enough ORBs because a vectorized loop may have many array references. Moreover, as cores support larger vector widths (e.g., 2048 bits) and support simultaneous multithreading (“SMT”) (e.g., 4-way) the number of ORBs will continue to be a limiting factor.
One solution would be to add more ORBs for the cache. Unfortunately, ORBs, especially for an L1 cache, are expensive in terms of area since they are close to the core and in terms of power since each cache miss initiates a fully associative look up encompassing all the ORBs for a matching address. In summary, while HBM will support significantly increased memory parallelism, current cores are unprepared to support such memory parallelism.