Reliability of an application program executing on a computing device is dependent on the reliability of the underlying operating system, in addition to the robustness thereof. When critical application programs are hosted on server machines that are highly utilized, the probability of server crashes may increase due to the increased load associated with the application programs. Clients executing the application programs may experience large outage times and/or incorrect execution associated therewith.
In a traditional computing model, the operating system, device drivers and the application program(s) occupy tiers of a hierarchy, in accordance with which the operating system and the device drivers are at a “low level,” in contrast to the application program(s) at a “high level.” Application programs may communicate to the operating system through system calls, and the operating system keeps the application programs apprised of events through signals. Although application program(s) execute in disjoint address spaces and, therefore, are insulated from external application program failures, the failure(s) associated with the operating system may affect the entire computing system.
Reliability solutions such as check pointing may enable the restart of application program(s) in the event of an operating system crash. Check pointing may involve writing the state information of a computing system to persistent storage from time to time. Following an operating system crash and a subsequent reboot of the computing system, an application program executing thereon may be restored to a previous state thereof prior to the crash. However, the change in state information from the previous checkpoint may be lost. Other solutions such as high availability clusters try to reduce downtime by maintaining partner node(s) that take over services associated with a primary node following a crash. The aforementioned solutions have limitations such as performance in the case of check pointing and homogenous hardware configuration requirement(s) in the case of clustering.