Massively Parallel Processing (“MPP”) computer systems are becoming increasingly larger. Such MPP computer systems commonly have 20,000+ sockets (sometimes with multiple processors per socket) that are connected via a high-speed network interconnect and that share a memory that may have its sized measured in terabytes. To take advantage of the increased processing power of these MPP computer systems, increasingly complex application programs are being developed. These application programs may have tasks executing on thousands of processors simultaneously and may take many hours to complete their execution.
As the number of processors and the density of the components in the MPP computer system increase and the complexity of the application programs increases, the probability of having a component fail during execution of an application program also increases. The failure of even a single component during execution of an application program may result in complete failure of that execution with a need to restart the execution from the beginning. Such a complete failure means that thousands of hours of processor execution is wasted. In addition, as the probability of a component failure increases, the likelihood that such an application program will successfully execute from its beginning until its end without any failure decreases.
Some runtime systems and application programs help ensure that execution of the application programs continues in the face of component failures or resumes without having to be restarted at the beginning. Traditional strategies for providing application programs with such “fault tolerance” have several limitations. Some of these strategies, such as system-directed checkpoints, do not scale well and appear to be reaching their limits as the number of processors and the amount of memory continue to increase. Some strategies also impose significant burdens on the application programmer and require a significant computational overhead during execution.
It would be desirable to minimize the impact of component failures so that the likelihood that an application program will successfully execute without failure increases and the amount of wasted processor resources is minimized.