1. Field of the Invention
This invention is related to the field of highly available computer systems and, more particularly, to the failing over of applications in computer systems, including clustered computer systems.
2. Description of the Related Art
Certain applications are often required to be available virtually uninterrupted, either 24 hours a day or at least during working hours. Various efforts have been undertaken to provide high availability services to support the high availability of such applications. Such highly-available applications may include email servers, web servers, database servers, etc.
Typically, efforts to provide high availability for a given application have focused on detecting that the application has failed and getting the application re-started. An application may fail due to an internal coding error in the application, an error in the operating system on which the application is running, an error in the hardware of the computer system on which the application is running, or a combination of any of the above errors. The errors may cause the application, or the operating system, to cease executing (e.g. a crash) or to stop functioning (e.g. a hang).
In some cases, each application for which high availability is desired may be assigned to a separate computer system. In this configuration, a failure of one application may not affect the operation of the other applications on the other computer systems. Additionally, this configuration allows for variations in the operating system on which the applications are run (e.g. different versions of the same operating system, or different operating systems). The cost of obtaining and maintaining separate computer systems for each application may be considerable.
Another method is to cluster a group of computer systems using specialized software (referred to as a cluster server) to control the group of computer systems. A given application may be executed on a first computer system of the group. The cluster server monitors the operation of the application and, if the cluster server detects that the application has failed, the cluster server may close the application on the first computer system and restart the application on another computer system. Typically, such cluster servers involve identifying, for each application supported by the cluster server, all of the state in the computer system that is needed to restart the application. In practice, such identification may be problematic and frequently involves making use of undocumented features of the application. Additionally, some applications may not function correctly when restarted on another machine. For example, the Exchange2000 application from Microsoft Corporation may not access a mailbox database used when the application was executing on another machine because Microsoft's Active Directory may identify that other machine as the owner of the database.