The present invention relates generally to checkpointing in computer systems; and, particularly, to checkpoints in applications running on high performance parallel computers.
To achieve high performance computing, multiple individual processors have been interconnected to form a multiprocessor computer system capable of parallel processing. Multiple processors can be placed on a single chip, or several chips—each containing one or more processors—become interconnected to form single- or multi-dimensional computing networks into a multiprocessor computer system, such as described in co-pending U.S. Patent Publication No. 2009/0006808 A1 corresponding to U.S. patent application Ser. No. 11/768,905, the whole contents and disclosure of which is incorporated by reference as if fully set forth herein, describing a massively parallel supercomputing system.
Some processors in a multiprocessor computer system, such as a massively parallel supercomputing system, typically implement some form of direct memory access (DMA) functionality that facilitates communication of messages within and among network nodes, each message including packets containing a payload, e.g., data or information, to and from a memory, e.g., a memory shared among one or more processing elements. Types of messages include user messages (applications) and system initiated (e.g., operating system) messages.
Generally, a uni- or multi-processor system communicates with a single DMA engine, typically having multi-channel capability, to initialize data transfer between the memory and a network device (or other I/O device).
Such a DMA engine may directly control transfer of long messages, which long messages are typically preceded by short protocol messages that are deposited into reception FIFOs on a receiving node (for example, at a compute node). Through these protocol messages, the sender compute node and receiver compute node agree on which injection counter and reception counter (not shown) identifications to use, and what the base offsets are for the messages being processed. The software is constructed so that the sender and receiver nodes agree to the counter ids and offsets without having to send such protocol messages.
In parallel computing system, such as BlueGene® (a trademark of International Business Machines Corporation, Armonk N.Y.), system messages are initiated by the operating system of a compute node. They could be messages communicated between the OS (kernel) on two different compute nodes, or they could be file I/O messages, e.g., such as when a compute node performs a “printf” function, which gets translated into one or more messages between the OS on a compute node OS and the OS on (one or more) I/O nodes of the parallel computing system. In highly parallel computing systems, a plurality of processing nodes may be interconnected to form a network, such as a Torus; or, alternately, may interface with an external communications network for transmitting or receiving messages, e.g., in the faun of packets.
As known, a checkpoint refers to a designated place in a program at which normal processing is interrupted specifically to preserve the status information, e.g., to allow resumption of processing at a later time. Checkpointing, is the process of saving the status information. While checkpointing in high performance parallel computing systems is available, generally, in such parallel computing systems, checkpoints are initiated by a user application or program running on a compute node that implements an explicit start checkpointing command, typically when there is no on-going user messaging activity. That is, in prior art user-initiated checkpointing, user code is engineered to take checkpoints at proper times, e.g., when network is empty, no user packets in transit, or MPI call is finished.
However, it is desirable to have the computing system initiate checkpoints, even in the presence of on-going messaging activity. Further, it must be ensured that all incomplete user messages at the time of the checkpoint be delivered in the correct order after the checkpoint. To further complicate matters, the system may need to use the same network as is used for transferring system messages.
It would thus be highly desirable to provide a system and method for checkpointing in parallel, or distributed or multiprocessor-based computer systems that enables system initiation of checkpointing, even in the presence of messaging, at arbitrary times and in a manner invisible to any running user program.