A current goal for designers of high-performance computing (“HPC”) systems is to reach exascale computing, that is, exascale floating-point operations per second (“exaFLOPS”). To achieve exascale computing, designers envision an exascale computing system with many nodes, each of which has many cores. The use of many cores per node allows for increased performance through parallelization. Unfortunately, many application programs are constrained by limited memory bandwidth, even with many fewer cores in a node. As a result of the limited memory bandwidth, the memory read requests of the application programs are queued at the core, and the application programs stall while waiting for the queued read requests to be processed.
High-bandwidth memory (“HBM”) has the potential of allowing such application programs to execute without incurring significant delays due to stalling while waiting for queued memory read requests. HBM currently provides up to five times the memory bandwidth of low-bandwidth memory (“LBM”), such as double data rate fourth generation (“DDR4”) memory. HBM achieves the higher bandwidth while using less power in a substantially smaller form factor than other memory techniques. The higher bandwidth may be achieved by stacking up to eight dynamic random access memories (“DRAM”) dies, which may include a base die with a memory controller. The memory bus of an HBM memory is very wide in comparison to other DRAM. An HBM stack of four DRAM dies may have two 128-bit channels per die for a total of eight channels and a total width of 1024 bits. Examples of HBM include the High-Bandwidth Memory provided by Advanced Micro Devices, Inc., the Hybrid Memory Cube provided by Micron Technology, Inc., and the Multi-Channel DRAM provided by Intel Corp.
For cost reasons, some computer architectures provide a large amount of LBM and a much smaller amount of HBM. The computer architectures may support different memory modes: cache mode, flat mode, and hybrid mode. With cache mode, the HBM operates as a cache memory. With flat mode, the physical address space of memory includes both LBM and HBM. With hybrid mode, a portion of the HBM operates as cache memory, and the remainder of the HBM is part of the physical address space of memory along with the LBM.
When the physical address space of memory includes HBM (i.e., the flat mode and the hybrid mode), the allocation of the data structures of a program between HBM and LBM can influence the execution performance of the program. As an example, a program may have a first data structure with an access pattern such that each element of the first data structure is written only once and never read, and a second data structure (of the same size) with an access pattern such that each element is read many times. In such a case, the performance of the program would likely suffer if the first data structure was allocated in HBM and the second data structure was allocated in LBM. Performance of the program may be improved significantly by storing as much of the second data structure as possible in the HBM. In general, the data structures that consume the most off-chip bandwidth (e.g., memory requests sent from the processor to memory per time interval) are likely candidates for allocation in HBM. Unfortunately, the identification of such candidates can be very difficult, even for an expert programmer. The difficulty arises, in part, because the identification has significant dependencies on both compiler optimizations and implementation of the host hardware. For example, compiler optimizations such as automatic vectorization, and hardware features such as out-of-order execution and prefetching, can significantly alter the memory access pattern of a target loop or region of a program that accesses a data structure.