High-performance computing (HPC) has seen a substantial increase in usage and interests in recent years. Historically, HPC was generally associated with so-called “Super computers.” Supercomputers were introduced in the 1960s, made initially and, for decades, primarily by Seymour Cray at Control Data Corporation (CDC), Cray Research and subsequent companies bearing Cray's name or monogram. While the supercomputers of the 1970s used only a few processors, in the 1990s machines with thousands of processors began to appear, and more recently massively parallel supercomputers with hundreds of thousands of “off-the-shelf” processors have been implemented.
There are many types of HPC architectures, both implemented and research-oriented, along with various levels of scale and performance. However, a common thread is the interconnection of a large number of compute units (also referred to as compute entities or compute entities herein), such as processors and/or processor cores, to cooperatively perform tasks in a parallel manner. Under recent System on a Chip (SoC) designs and proposals, dozens of processor cores or the like are implemented on a single SoC, using a 2-dimensional (2D) array, torus, ring, or other configuration. Additionally, researchers have proposed 3D SoCs under which 100's or even 1000's of processor cores are interconnected in a 3D array. Separate multicore processors and SoCs may also be closely-spaced on server boards, which, in turn, are interconnected in communication via a backplane or the like. Another common approach is to interconnect compute units in racks of servers (e.g., blade servers and modules). IBM's Sequoia, alleged to have once been the world's fastest supercomputer, comprises 96 racks of server blades/modules totaling 1,572,864 cores, and consumes a whopping 7.9 Megawatts when operating under peak performance.
HPC enables the workload for solving a complex job or task to be distributed across multiple compute entities using a parallel processing approach; this may entail use of thousands or even 100's of thousands of entities. In view of the statistical distribution of entity failures, as the number of entities employed for an HPC job increases, the rate at which a entity failure will occur during the HPC job increases exponentially. This exponential failure rate has become a hot issue among the HPC community, as well as commercial cloud service providers.
To address the possibility of entity failures, HPC jobs are performed in a manner that enables recovery from such failures without having to redo the job (or significant portions of the job). This is commonly done through a checkpoint-restart scheme. Under one conventional approach, checkpoints are taken periodically at frequent rates (the time period between checkpoints is known as an epoch) and a synchronized manner, wherein for each epoch processing on all entities in a checkpoint group is halted, a checkpoint operation is performed on each entity, and the entities are restarted. The granularity of the checkpoint groups is fairly course, and may involve 100's or 1000's of entities.
During each checkpoint, data is written to some form of non-volatile storage (e.g., a mass storage device or array of such devices assessed over a network). The data include both job processing state information and data produced as output via execution of software on each entity. This results in a substantial amount of storage consumption and a significant percentage of overall processing bandwidth is effectively wasted. In some instances, the associated storage consumption and execution restrictions of this conventional checkpoint-restart strategy make the actual result less sustainable or even practical.