High performance computing (“HPC”) or “supercomputer” systems are used to perform computations that require large quantities of computing resources. HPC systems may be used, for example, in weather forecasting and aerodynamic modeling, cryptography and code breaking, simulation of nuclear weapons testing or molecular dynamics, and ‘big data’ analytics. These applications may require large amounts of memory or data storage, and large numbers of (or extremely fast) memory accesses or computational operations. Often, these large amounts of memory or data storage are provided by network many computers together. Some clustered HPC systems provide federated memory using non-uniform memory access (“NUMA”), which allows each node to access the memory of some or all of the other nodes.
There are two main paradigms used to design HPC systems: scale-out and scale-up, which roughly correspond to the ideas of ‘bigger’ and ‘better’. Scale-out systems are ‘bigger’, in the sense that they network many commodity computing devices (such as retail server computers) in a cluster. By contrast, scale-up systems are ‘better’, in the sense that they embody better, often cutting-edge technology: faster processors, faster memory, larger memory capability, and so on.
As HPC systems scaled out, the computing resources required for the operating system kernel to intercede on behalf of the user application became a performance bottleneck. To combat this problem, remote direct memory access (“RDMA”) and a direct data placement (“DDP”) protocol were developed, allowing a user application to configure networking hardware to send and receive data directly from application memory over the network interconnect to remote nodes, without kernel processing. Despite the development of many technologies to improve the HPC network interconnect, HPC system design still largely involves choosing between scale-out and scale-up based on the particular type of application. Paged applications are often cheaper using scale-out designs that don't require RDMA, while other applications work better with scale-up designs that use RDMA and cache lines.