1. Field
The present disclosure relates generally to distributed systems, and more particularly, to systems and techniques for recovering from system failures in distributed systems.
2. Background
Computers and other modern processing systems have revolutionized the electronics industry by enabling complex tasks to be performed with just a few strokes of a keypad. These processing systems have evolved from simple self-contained computing devices, such as the calculator, to highly sophisticated distributed systems. Today, almost every aspect of our daily lives involves, in some way, distributed systems. In its simplest form, a distributed system may be thought of an individual computer capable of supporting two or more simultaneous processes, or a single process with multiple threads. On a larger scale, a distributed system may comprise a network with a mainframe that allows hundreds, or even thousands, of computers to share software applications. Distributed systems are also being used today to replace traditional supercomputers with any number of computers, servers, processors, or other components being connected together to perform specialized applications that require immense amounts of computations. The Internet is another example of a distributed system with a host of Internet servers providing the World Wide Web.
As we become more dependent upon distributed systems in our daily lives, it becomes increasingly important to guard against system failures. A system failure can be at the very least frustrating, but in other circumstances could lead to catastrophic results. For the individual computer, a system failure can result in the loss of work product and the inconvenience of having to reboot the computer. In larger systems, system failures can be devastating to the business operations of a company or the personal affairs of a consumer.
There are a number of system recovery techniques that are employed today to minimize the impact of system failures. One such technique is known as “rollback recovery.” The basic idea behind rollback recovery is to model the operation of a system as a series of states, and when an error occurs, to roll back the system to a previous error-free state and resume operation. One technique for implementing rollback recovery is commonly referred as Checkpoint-Based Rollback Recovery. Using this technique, the system saves in a stable database some of the states it reaches during operation as “checkpoints,” and when an error occurs, the system is restored to a previous error-free state from the checkpoints.
Log-Based Rollback Recovery is another technique that builds on the concept of Checkpoint-Based Rollback Recovery. In addition to checkpoints, this technique also uses information about non-deterministic events that occur between successive checkpoints. A non-deterministic event is generally an input to the system whose timing and content are unknown by system prior to receipt. However, for a given input and a given state in which the system receives this input, the execution of the system until the reception of the next input is deterministic. As a result, the execution of the system can be modeled as a sequence of deterministic state intervals, each initiated by a non-deterministic event. This follows the “piecewise deterministic” (PWD) assumption which postulates that all non-deterministic events that cause state transitions to the system can be recorded as determinants. When this assumption holds true, system recovery may be achieved by restoring the system to a previous prior error-free state based on the checkpoints, and then replaying the recorded determinants to restore the system to the state that existed just prior to the error.
Unfortunately, current Log-Based Rollback-Recovery techniques have no mechanism to deal with certain types of non-determinism inherent in systems capable of handling multiple processes, or a single process with multiple threads, that share a common state (i.e., address space). As an example, consider a distributed system on the Internet in which two computers conducting an e-commerce transaction with a server compete to purchase the same item. In this example, a scheduling entity within the server will determine which computer is granted access first and, hence, is able to consummate the transaction. However, should a system failure occur and the server be rolled back to a previous error-free state that existed prior to the transaction, there is no guarantee that the same computer will be granted access to the server before the other without extremely invasive modifications to the operating system and/or applications. This can be especially problematic when the system fails after the server confirms the original transaction.