High Performance Computing (HPC) systems, sometimes referred to as supercomputers, typically include a plurality of compute nodes each having one or more processing units or cores. More typically, such HPC systems include hundreds or even thousands of cores that may be distributed across a network or placed in a close proximity to one another (e.g., in a computer cluster). Such HPC systems are used for a wide range of computationally intensive applications in various fields, including, without limitation, quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling and physical simulations.
The multiple compute nodes of a HPC system typically operate independently and periodically output information in a burst mode. The faster the burst, the higher the performance of the application. The burst output is typically stored to an enterprise level storage architecture. Due to the independent operation of the compute nodes, data output to the storage architecture may encounter different levels of congestion. Such congestion may result in variability in the data transfer latency or speed of individual, blocks, packets, or other data units. Should variability exceed specified tolerances, a receiving device (e.g., storage architecture) may experience slower performance or bottlenecking while receiving data. Stated otherwise, the independent operation of the compute nodes can result in jitter, which reduces the data transfer rate to the storage architecture.
One current storage approach for HPC systems is to provide a storage architecture (e.g., parallel file system) that provides enough bandwidth to sustain a 100% duty cycle burst (i.e., simultaneous burst of all compute nodes). This is to ensure that write bandwidth is available when the compute nodes are ready to dump accumulated computations. This approach provides a brute force solution by using hundreds of, for example, conventional 20 GB/S block storage machines behind a parallel file system. The volume of storage machines required to provide the necessary bandwidth causes many infrastructure problems, including management logistics, mean time between failure (MTBF) issues, power infrastructures and cabling.
While such an approach works, it is inefficient and is typically cost prohibitive to design for a desired maximum bandwidth. As a result, users will typically limit their purchase and live with poorer performance.