1. Field
The present disclosure relates generally to distributed processing systems, and more particularly to systems and techniques for recovering from system failures.
2. Background
Computers and other modern processing systems have revolutionized the electronics industry by enabling complex tasks to be performed with just a few strokes of a keypad. These processing systems have evolved from simple self-contained computing devices, such as the calculator, to highly sophisticated distributed processing systems. Today, almost every aspect of our daily lives involves, in some way, distributed processing systems. In its simplest form, a distributed processing system may be thought of an individual desktop computer capable of supporting two or more simultaneous processes, or a single process with multiple threads. On a larger scale, a distributed processing system may comprise a network with a mainframe that allows hundreds, or even thousands, of individual desktop computers to share software applications. Distributed processing systems are also being used today to replace traditional supercomputers, with any number of computers, servers, processors or other components being connected together to perform specialized applications that require immense amounts of computations. The Internet is another example of a distributed processing system with a host of Internet servers providing the World Wide Web.
As we become more dependent upon distributed processing systems in our daily lives, it becomes increasingly important to guard against system failures. A system failure can be at the very least annoying, but in other circumstances could lead to catastrophic results. For the individual desktop computer, a system failure can result in the loss of work product and the inconvenience of having to reboot the computer. In larger systems, system failures can be devastating to the business operations of a company or the personal affairs of a consumer.
A number of system recovery techniques are employed today to minimize the impact of system failures. One such technique involves “checkpointing” and “rollback recovery.” During normal operation, each of a computer's processes saves a snapshot of its states, called a “checkpoint,” to stable storage. When a failure occurs, a rollback recovery mechanism retrieves a set of saved checkpoints. The failed process can then roll back to the corresponding retrieved checkpoint and resume execution from there. Although this type of automatic recovery is much faster than waiting for a process failure to be manually resolved, computation speed is nevertheless hindered since computation is blocked while checkpoints are saved to stable storage and during process rollbacks.
Computation speed can be even more negatively impacted in distributed systems. In a distributed system, processes communicate by message passing, such that individual process states may become dependent on one another. Determining the state of a distributed system, then, can be complicated. Rollback recovery in a distributed system requires checkpointing a consistent global state, which is a set of process states in which the processes agree on whether or not message exchange among processes has occurred. Thus, each process must make its checkpoints in coordination with other processes. This, of course, increases the amount of time required to store checkpoints. Since access speed to most types of stable storage is orders of magnitude slower than computation, checkpointing consistent global states in a distributed system can have a significant negative effect on computation speed.
One technique for mitigating the slowdown is known as “concurrent checkpointing.” This technique involves using memory protection so that computation is allowed to proceed while a checkpoint is stored to stable storage. Concurrently storing checkpoints in the background can prevent computation from being blocked for long periods in certain types of distributed computing systems. This technique can be useful in systems that take checkpoints infrequently, for example on the order of every 15 minutes or more. Unfortunately, many systems require taking small checkpoints very frequently, such as on the order of hundreds of milliseconds. For example, enterprise systems involving financial transactions and other rapidly changing data sets must be frequently checkpointed since the transactions themselves cannot be repeated in order to re-start a system during a rollback. In this type of system, concurrent checkpointing has not been as beneficial because a second checkpoint is often requested while a first checkpoint is still being stored. This means that the process is blocked until the first checkpoint has finished storing and computation cannot proceed. In the case of rapid checkpointing, a process can be repeatedly blocked this way, such that performance is significantly diminished. Thus, while concurrent checkpointing has helped computation time in some distributed systems, it has not resolved the problem of computation blocking in many important types of distributed systems.