1. Field of the Invention
This invention pertains generally to enterprise computer systems, computer networks, embedded computer systems, and computer systems, and more particularly with methods, systems and procedures (i.e., programming) for providing high availability services and automatic fault detection and recovery for computer applications distributed across multiple computers.
2. Description of Related Art
Enterprise systems operating today are subject to continuous program execution, that is 24 hours a day and 7 days a week. There is no longer the concept of “overnight” or “planned downtime”. All programs and data must be available at any point during the day and night. Any outages or deteriorated service can result in loss of revenue as customers simply take their business elsewhere, and the enterprise stops to function on a global scale. Traditionally, achieving extremely high degrees of availability has been accomplished with customized applications running on custom hardware, all of which is expensive and proprietary. Furthermore, application services being utilized today are no longer run as single processes on a single server, yet are built instead from a collection of individual programs running on different servers. Traditionally, no mechanisms have existed for protecting these fully distributed applications. This problem is compounded by the fact that the individual applications comprising the service are typically provided by different vendors.
Two publications provide a background for understanding aspects of the current invention. A first publication is U.S. patent application Ser. No. 11/213,678 filed on Aug. 26, 2005, and published as US 2006-0090097 A1 on Apr. 27, 2006, incorporated herein by reference in its entirety, which describes providing transparent and automatic high availability for applications where all the application processes are executed on one node. A second publication is U.S. patent application Ser. No. 11/213,630 filed on Aug. 26, 2005, and published as US 2006-0085679 A1 on Apr. 20, 2006, incorporated herein by reference in its entirety, which describes technology to support stateful recovery of multi-process applications wherein the processes are running on the same node. However, the above-referenced publications do not fully address distributed applications where an application runs across multiple nodes at the same time, and where fault detection and recovery need to involve multiple independent nodes.
Therefore, a need exists for a method and system for achieving high availability and reliability for distributed applications, in a manner that is automatic, transparent to the client, and which does not require custom coding, custom applications, or specialized hardware.