A typical high-performance computing (HPC) system, sometimes referred to as a “supercomputer”, has a large number of nodes that cooperate to perform computations by sharing memory and communicating data with each other. Each node has specialized hardware that has capabilities similar to a high-end server computer. HPC systems achieve computational performance that is much greater than retail devices such as desktop computers. Performance may be increased in one of two general ways: by improving the technology of the underlying hardware (scaling “up”), or by increasing the quantity of hardware, such as number of processors or amount of memory (scaling “out”). Most HPC systems balance the two types of scaling to achieve maximum performance for the least price. Whatever the balance may be, for ease of system administration the majority of nodes in an HPC system usually have identical hardware and configuration, although a small number of nodes may differ according to the needs of the system users.
One architectural goal in such a shared memory system is to provide uniform latencies (delays) when one node accesses the memory of another, remote node (e.g., for storing data or for reading stored data). That is, the goal is to design the system such that each access to remote memory takes about the same amount of time. When the distribution of memory access latencies is narrow, different messages between nodes can share resources without significant performance impacts. Uniform memory access latencies can be accomplished, while scaling the system out to a reasonable amount of memory, by installing in each node identical dual in-line memory modules (DIMMs), each DIMM capable of storing the same amount of dynamic random-access memory (DRAM).
A DRAM can be thought of as a rectangular array having many rows and columns of words, each word being a fixed number of (e.g. 32 or 64) bits and having a row address and a column address. To access a word in memory, a computing device first presents to the DRAM a signal encoding the row address, which activates all bits on that row for reading and writing; this row activation incurs a first latency. Next, the computing device presents to the DRAM a signal encoding the column address, which connects the correct, already active, bits of the word to the output; this column selection incurs a second latency. Once a row is active, multiple words in different columns may be read without incurring an additional row activation latency, so accessing the first word in a new row is typically slower than accessing the second word and subsequent words in the same row. For example, a typical memory module (DDR4-4000 SDRAM) has a latency of about 9.5 nanoseconds (ns) from receiving a read command to presenting a first word at its output, but a latency of only about 0.25 ns to present each successive word from the same row. Representing 0.25 ns as a frequency, once a row is active this module can perform 4000 “mega transfers” (i.e. 4000 million transfers) per second, as its name suggests.
Uniform memory access times cannot be achieved when the DRAM modules installed in each node have different access latencies. In particular, some HPC system applications benefit from scaling up some of the nodes to include DIMMs that provide non-volatile storage (NVDIMMs) in addition to the volatile DIMMs. When the power to an NVDIMM is disconnected, stored data are retained, and are again accessible when power is restored. Such non-volatile DIMMs may prevent a loss of data due to an unexpected power outage that would require further operation of the HPC system, and may facilitate recovery from a system crash, among other applications. However, a typical NVDIMM may have an access latency between three to nine times slower than a conventional DRAM DIMM. Thus, if a volatile DIMM and a non-volatile DIMM are both present in a remote node, addressable using different memory address ranges, then the latency experienced by a node attempting to access the memory of the remote node will vary drastically as a function of which memory address is requested.
Accessing slower memory (like an NVDIMM) occupies limited resources for longer times than accessing a faster memory (like an ordinary DIMM). When these resources are shared, resource exhaustion by accessing slow memory can prevent uncorrelated accesses to fast memory that otherwise would have completed. In particular, in a computer system with shared memory that is accessed over a data connection between nodes, such as an HPC system, a slower memory access can tie up the connection resources, reducing the speed of all memory accesses that use the connection.