Massively parallel systems, such as supercomputing systems, use checkpoints to allow the system to recover from a failure. The system stores its state information as checkpoint data so that when and if a failure occurs, the system can be restarted at the checkpoint by loading the checkpoint data. Checkpoints are needed in supercomputing systems because the systems are so large and application runtimes so long (often days or weeks), that restarting at the beginning of a process after an error would be unacceptable.
Checkpoint data is usually stored on rotating magnetic media. The rotating magnetic media has relatively slow input/output bandwidth, located across a remote network from the computing system. Accordingly, the creation of checkpoints, which can include modification of prior checkpoints, involving the storage of the system's state information, consumes an unacceptable amount of machine time, possibly as much as 25 percent of the machine time in some cases. As the computing systems become more sophisticated and expensive, spending this much time checkpointing is unacceptable.
An example of a checkpointing system 10 is shown in FIG. 1. One or more processors, such as CPU 20, include or are in communication with a bus controller 22 and memory controller 24. The CPU 20, bus controller 32 and memory controller 24 can be on a same circuit board or package, and can even be fabricated on a same substrate. The memory controller 24 communicates over memory bus 40 with main memory devices 42, 43, 44, 45, located on a second package or substrate. These main memory devices store data used by the CPU 20 during normal operation of the system 10.
The bus controller 22 may communicate with other units over a communication interface 26. The bus controller 22 also coupled to a peripheral bus 30. The peripheral bus 30 can be located on the same substrate as the CPU 20, bus controller 22 and memory controller 24. Input and output devices 32, 34 are coupled to the peripheral bus 30 for communication with the bus controller 32.
A storage controller 50 is also coupled to the peripheral bus 30. The storage controller 50 communicates over a network 52 with a remote controller 54 to the checkpointing memory contained in a storage system 56. Accordingly, checkpoint memory in the system 10 is physically remote (often feet or miles away) from the CPU 20 and connects through a network 52, such as a LAN. Access to and from the checkpoint memory is accordingly slow and cumbersome.
Checkpoints are also created in systems that process both classified and unclassified information. A checkpoint can be created before switching from a classified context to an unclassified context. Access to the checkpoint data is then disabled prior to switching to the unclassified context. The checkpoint data is later reloaded when classified processing resumes. A reverse procedure occurs when switching from unclassified to classified processing.
As processors become increasingly complex and fast, checkpointing will likely become desirable in servers and even personal computers.
Accordingly, there is a need for a checkpoint memory that can quickly store checkpoint data such that checkpointing does not consume an undue amount of computing time.