The paradigm of parallel computing is nowadays based on large numbers of networked physical servers, which are adapted to commission, execute, and decommission virtual servers (or virtual machines) in parallel and independently of each other.
With an advancing use of physical server components originally developed for consumer or embedded products, a lack of built-in functionality for fault detection and recovery arises, ultimately resulting in a significantly lower dependability of such physical servers in comparison with their predecessors.
In presence of a more failure-prone physical server infrastructure, techniques such as checkpointing may be taken advantage of for maintaining availability of the executed virtual machines or processes.
Checkpointing is about proactive and frequent acquisition of snapshots (or checkpoints) of virtual machines or compute processes to enable their immediate continuation upon a failure of the underlying physical servers.
Despite being a well-known technique, checkpointing in parallel computing systems suffers from several factors. For example, when large numbers of physical servers with multiple processor cores per each are deployed, the sheer number of accruing checkpoints may require limiting the frequency in which checkpoints are acquired or renewed. This results in a longer mean time to repair (MTTR), which is a crucial factor in the calculation of the system availability.
The same may apply when block storage devices such as hard disks (HDD), solid-state disks (SSD) or Flash memory are used to store the accruing checkpoints, because of the limiting factor of extensive access latencies. In this context, however, the processing of accruing checkpoints may also become the critical bottleneck.
In reference [1], a physical server infrastructure for parallel computing based on a microserver concept is described, wherein a microserver is defined to integrate all integrated digital circuits of a server motherboard in a single System on a Chip (SoC), excluding volatile random access memory (RAM), boot flash memory and power converters, but including the networking interface.
The use of compression techniques in virtual machine checkpointing is disclosed in reference [2], in particular in relation to a proposed technique, which takes advantage of checkpoint similarity.
In reference [3], checkpointing is applied to inter-process communication to enhance fault tolerance in distributed systems.
Phase-change memory (PCM) technology as a basis for high-performance SSDs is proposed in reference [4].
Accordingly, it is an aspect of the present invention to improve storing checkpoints.