Distributed computer systems have become a popular response to an ever increasing demand for computing system resources. However, the increasing complexity of distributed computer systems has resulted in threats to their robustness and reliability such as resource depletion, Heisenbugs (system bugs that change behavior during debugging), deadlocks and other transient faults. Multiplying the number of servers or, more generally, server replicas (i.e., instances of a server executing simultaneously on multiple computers) provides helpful redundancy but it doesn't solve every robustness and reliability problem. In particular, recovery from component underperformance or outright failure in conventional distributed computer systems may not be possible without excessive disruption of computer system resource users and/or may result in data loss.
Examples of conventional distributed computer systems include the “UNIX” Network Filesystem (NFS) and its variants, the “GOOGLE” File System (GFS), the Calypso file system, the Echo file system, the Harp file system, the Frangipani file system, the Pangaea file system, the Ivy file system and the Coda file system as described in Kistler et al., “Disconnected Operation in the Coda File System,” Symposium on Operating Systems Principles (SOSP), October 1991 and, more generally, in James J. Kistler, “Disconnected Operation in a Distributed File System,” Technical Report CMU-CS-93-156, Carnegie Mellon University, May 1993. For the purposes of this description, distributed computer system components may be categorized as playing a server role (server-side components) or a client role (client-side components). In practical systems, distributed computer system components in a client role may be further categorized as operating at a user-level or a kernel-level. This distinction is particularly relevant to failure recovery mechanisms because failure of kernel-level components is typically more disruptive than failure of user-level components. In addition, kernel-level components are typically required to comply with a different set of operational constraints than user-level components.
Some conventional distributed computer systems provide for lossless restartability of server-side components but not client-side components. Some client-side components may not be transparently restarted, for example, a kernel-level client component failure may require a computer reboot (e.g., computer operating system restart). Some conventional distributed computer systems fail to minimize the complexity of kernel-level client components. Some conventional distributed computer systems incorporate transparently restartable user-level client components but do not provide for lossless restart which may result in the loss of, for example, any computer system resource updates that occurred in the 30 seconds before component failure.
Some conventional distributed computer systems provide for transparent restartability of server-side components but lack broad spectrum fault tolerance that includes, for example, Byzantine fault tolerance as well as fail-stop fault tolerance, such as may be supported by replicated state machine (RSM) architectures. Furthermore, some conventional distributed computer systems fail to provide an effective solution to the problem of underperforming server-side components. In particular, some conventional distributed computer systems that utilize state-based updates (e.g., some systems incorporating server replicas) fail to enable efficient incremental state changes without resorting to, for example, low-level page-based solutions or idiosyncratic solutions applicable only to narrow cases.