Computing technology has advanced at a remarkable pace, with each subsequent generation of computing system increasing in performance, functionality, and storage capacity, often at reduced cost. However, despite these advances, many scientific and business applications still demand massive computing power, which can only be met by extremely high performance computing (HPC) systems. One particular type of computing system architecture that is often used in high performance applications is a parallel processing computing system.
Generally, a parallel processing computing system comprises a plurality of physical computing nodes and is configured with an HPC application environment, e.g., including a runtime environment that supports the execution of a parallel application across multiple physical computing nodes. Some parallel processing computing systems, which may also be referred to as massively parallel processing computing systems, may have hundreds or thousands of individual physical computing nodes, and provide supercomputer class performance. Each physical computing node is typically of relatively modest computing power and generally includes one or more processors and a set of dedicated memory devices, and is configured with an operating system instance (OSI), as well as components defining a software stack for the runtime environment. To execute a parallel application, a cluster is generally created consisting of physical computing nodes, and one or more parallel jobs or tasks are executed within an OSI in each physical computing node and using the runtime environment such that jobs or tasks may be executed in parallel across all physical computing nodes in the cluster.
Performance in parallel processing computing systems can be dependent upon the communication costs associated with communicating data between the components in such systems. Accessing a memory directly coupled to a processor in one physical computing node, for example, may be several orders of magnitude faster than accessing a memory on different physical computing node, or accessing a larger network attached storage. In addition, retaining the data within a processor and/or directly coupled memory when a processor switches between different jobs or tasks can avoid having to reload the data. Accordingly, organizing the jobs or tasks executed in a parallel processing computing system to localize operations and data and minimize the latency associated with communicating data between components can have an appreciable impact on performance.
Some parallel processing computing systems have attempted to reduce the communication costs of accessing data from network attached storage by utilizing a burst buffer data tier between the network attached storage and the computing nodes and consisting of a smaller but faster non-volatile storage, e.g., solid state drive (SSD)-based storage. Burst buffers may be used, for example, to stage in input data to be used by a task or job prior to execution of the task or job and then stage out output data generated by the task or job once execution is complete. In addition, burst buffers may be used to assist with overlapping tasks or jobs such that while data is being staged for some jobs or tasks, other jobs or tasks that have already been staged can execute using the previously-staged data, in many instances allowing for the apparent input/output (I/O) performance to be more equivalent to SSD speeds, rather than network storage speeds. Burst buffers, however, are generally limited in storage capacity, and as a result, some jobs or tasks may be unable to be staged until other jobs or tasks are completed and staged out, thereby limiting the ability to overlap jobs or tasks due to burst buffer capacity constraints.