A common problem in computer systems, particularly transaction-based computer systems operating on a database, is providing some form of tolerance or resilience to failures that may occur during processing. Such tolerance typically is provided by checkpointing and redundancy. Checkpointing typically involves periodically saving the processing state of a machine and, after detection of a failure, restoring the state of the computer to a previously saved internally consistent processing state. Computer systems that provide checkpointing and redundancy typically use specially designed hardware and/or operating systems, or require an application programmer to create appropriate checkpoints.
The complexities of providing a checkpointing facility are increased in dataflow and parallel computer systems, particularly dataflow systems used on parallel databases, and the Orchestrate application environment from Torrent Systems, Inc., and other similar products. Some of these problems are explored, in part, in "Loading Databases Using Dataflow Parallelism," SIGMOD Record, Volume 23, Number 4, pages 72-87, December 1994.