1. Field of the Invention
This invention pertains to software-based fault tolerant computer systems, computer networks, telecommunications systems, embedded computer systems, wireless devices such as cell phones and PDAs, and more particularly to methods, systems and procedures (i.e., programming) for consistent replication of application programs across two or more servers.
2. Description of Related Art
In many environments one of the most important features is to ensure that a running application continues to run even in the event of one or more system or software faults. Mission critical systems in telecommunications, military, financial and embedded applications must continue to provide their service even in the event of hardware or software faults. The auto-pilot on an airplane is designed to continue to operate even if some of the computer and instrumentation is damaged; the 911 emergency phone system is designed to operate even if the main phone system if severely damaged, and stock exchanges deploy software that keep the exchange running even if some of the routers and servers go down. Today, the same expectations of “fault-free” operations are being placed on commodity computer systems and standard applications.
Fault tolerant systems are based on the use of redundancy (replication) to mask faults. For hardware fault tolerance, servers, networking or subsystems are replicated. For application fault tolerance, the applications are replicated. Faults on the primary system or application are masked by having the backup system or application (the replica) take over and continue to provide the service. The take-over after a fault at the primary system is delicate and often very system or application specific.
Several approaches have been developed addressing the fundamental problem of providing fault tolerance. Tandem Computers (http://en.wikipedia.org/wiki/Tandem_computer) is an example of a computer system with custom hardware, custom operating system and custom applications, offering transaction-level fault tolerance. In this closed environment, with custom applications, operating system and hardware, a fault on the primary system can be masked down to the transaction boundary and the backup system and application take over seamlessly. The fault-detection and failover is performed in real-time.
In many telecommunication systems fault tolerance is built in. Redundant line cards are provided within the switch chassis, and if one line card goes down, the switching fabric automatically re-routes traffic and live connections to a backup line card. As with the Tandem systems, many telecommunications systems are essentially closed systems with custom hardware, custom operating systems and custom applications. The fault detection and failover is performed in real-time.
In enterprise software systems the general approach taken is the combined use of databases and high availability. By custom programming the applications with hooks for high-availability it is generally possible to detect and recovery from many, but not all, types of faults. In enterprise systems, it is typically considered “good enough” to recover the application's transactional state, and there are often no hard requirements that the recovery be performed in real-time. In general, rebuilding the transactional state for an application server can take as much as 30 minutes or longer. During this time, the application services, an e-commerce website for instance, is unavailable and cannot service customers. The very slow fault recovery can to some extent be alleviated by extensive use of clustering and highly customized applications, as evidenced by Amazon.com and ebay.com, but that is generally not a viable choice for most deployments.
In U.S. Pat. No. 7,228,452 Moser et al teach “transparent consistent semi-active and passive replication of multithreaded application programs”. Moser et al disclose a technique to replicate running applications across two or more servers. The teachings are limited to single process applications and only address replica consistency as it related to mutex operations and multi-threading. Moser's invention does not require any modification to the applications and work on commodity operating systems and hardware. Moser is incorporated herein in its entirety by reference.
Therefore, a need exists for systems and methods for providing transparent application-replication that address all types of applications, including multi-process multi-threaded application, application that use any type of locking mechanisms and application that access any type of external resources. Furthermore, the application-replication must be consistent and work on commodity operating system, such as Windows and Linux, and commodity hardware with standard applications.