It is generally accepted that device drivers cause the majority of failures in commodity operating systems (OSs). Earlier work has shown that an OS kernel reliability system can prevent driver errors from crashing the Linux™ OS kernel, thus maintaining OS integrity. Thus far, however, reliability subsystems have been unable to maintain the integrity of applications using a failed device driver. The failure and recovery of a device driver typically terminates all applications using that device driver, since applications are rarely written to handle device driver errors.
The importance of recovery has long been known in the database community, where transactions prevent data corruption and allow applications to manage failure. More recently, failure recovery has become an important issue for OSs and applications.
The most general approach to recovery is to run application replicas on two machines, a primary and a backup. All inputs to the primary are mirrored to the backup. After a failure of the primary, the backup machine takes over to provide service. The replication can be performed within the hardware, at the hardware-software interface, at the system call interface, or at a message passing or application interface. However, this approach adds considerable cost and complexity.
Another common recovery approach is to restart applications after a failure. Many systems periodically save the application state as checkpoints, while others combine checkpoints with logs. These systems transparently restart failed applications from their last checkpoint (possibly on another machine) and replay the log if one is present. However, recent work has shown that this approach is limited when recovering from application faults, since applications often become corrupted before they fail, and thus, their logs or checkpoints may also be corrupted. Yet another approach is simply to reboot the failed component.
A system that was previously developed to handle device driver and extension faults called “Nooks™” takes this latter approach and unloads and reloads failed OS kernel extensions, such as device drivers. Rebooting has been proposed as a general strategy for building high-availability software, but forces applications to handle the failure, for example, by taking over the task of re-initializing state that has been lost by the rebooted component. However, few existing applications are able to reboot without losing state. Accordingly, this approach is not practical for improving the reliability of existing applications. Clearly, a solution is needed that addresses the problems of device driver failures by transparently restoring device driver state lost in the reboot, invisibly to applications, so that the failure of the device driver minimally impacts the OS and any applications using the device driver.
The solution should also facilitate device driver isolation in order to prevent failed device drivers from corrupting the OS or applications. Such isolation can be provided in various ways. It has been proposed to encapsulate extensions using software fault isolation, and to use transactions to repair OS kernel state after a fault. Nooks™ and other approaches isolate extensions in protection domains enforced by virtual memory hardware. MicroOS kernels and their derivatives force isolation by executing extensions in user mode. Rather than concealing failures, though, all of these systems take a revealing strategy, in which the application or user is made aware of the failure. The OS typically returns an error code, telling the application that a system call failed, but little else (e.g., it does not indicate which component failed or how the failure occurred). The burden of recovery then rests on the application, which must decide what steps to take to continue executing. Most applications are not prepared to handle the failure of device drivers, since device driver faults typically cause a system crash on commodity OSs.
Mechanisms have been proposed that transparently improve the reliability of existing software through interposition. Other systems approach the same goal by verifying the correctness of system calls, restarting applications after a failure, retrying failed system and library calls, restarting OS kernel extensions after a failure, or reconnecting applications to databases after a failure. Accordingly, it would be desirable to use procedure call interposition to mirror and redirect OS kernel-device driver communications.
Several systems have narrowed the scope of recovery to focus on a specific subsystem or component. For example, the Rio file cache achieves high performance by isolating a single system component, the file cache, from OS kernel failures. Another technique provides transparent recovery after the failure of a single component type, replicated databases in multi-tier applications. Thus, it appears that a solution to the problem of system stability should focus on recovery for a single OS component type, the device driver, which is the leading cause of OS failure. By abandoning general-purpose recovery, a major cause of application and OS failure can be resolved, while simplifying implementation and reducing runtime overhead. In a more general sense, the solution that is developed for handling device driver failures should be applicable for similarly handling failure of other types of software modules, so that the OS and applications using the modules are minimally (or not at all) affected by the failure of a module.