With the advent of enterprise servers came the increased demand for applications residing on the servers to be mission-critical. Growing dependence on such servers and applications has left little tolerance for downtime in computing environments. High-availability computing, once considered a strategic advantage, has now become a tactical necessity. The need for a 99.999% availability means that mission-critical applications cannot be down for more than a few minutes over the span of an entire year.
One example of a mission-critical application is the customer and product database server of an online retail store. Because the store's customers can place orders at any conceivable hour of the day, and the site deals with multiple time zones, the database must be available for 24 hours a day, 7 days a week. Any downtime has a direct effect on store's sales and profits.
An enterprise system consisting of only one server has the potential for taking a long time to restore from a failure. While the server is down, all programs and resources on that server are unavailable to the users. Because the availability of the server is directly related to whether or not the server is running, this type of system configuration is not acceptable for mission-critical applications.
By clustering two server nodes to appear as one logical server, Microsoft Cluster Service (MSCS) provides a simple solution for availability for Microsoft Windows server systems. When one server fails in a clustered system, applications running in the form of cluster resources and owned by the failed server are taken over by the surviving node, thus minimizing the application downtime. This process is referred to as failover.
The time needed for the resources from a failed server to transfer and resume on the surviving server is determined by a variety of factors and ranges from a few seconds to up to 30 minutes or even longer in some situations. This is still unacceptable for mission-critical applications. Thus the need arises to expedite failover when a failure occurs or is imminent in a MSCS cluster.
Also, the Cluster (MSCS) itself is prone to failures because of various hardware and software reasons. Each time the cluster goes down, it brings down the applications that are configured to run under the cluster environment. The fixes required in many of these situations are reasonably complex, time consuming (if performed manually) and require attention of an experienced cluster administrator. Thus the need arises for monitoring a MSCS cluster for such critical cluster failures, executing automated-fixes, whenever possible, and notifying operations personnel of failures, together with possible solutions, all to avoid long error detection and resolution processes.
One prior art method to which the method of the present invention generally relates is described in U.S. Pat. No. 6,088,727 entitled CLUSTER CONTROLLING SYSTEM OPERATING ON A PLURALITY OF COMPUTERS IN A CLUSTER SYSTEM. This prior art method is a cluster controlling system that transfers packages which have been operating on one computer to another computer when a fault or failure has occurred by monitoring and controlling the packages in the entire system. When the respective packages are started-up, cluster daemons on the respective computers monitor and control resources on the operating computers. The monitored and controlled data are stored in the respective computers as local data. A manager communicates with cluster daemons on the respective computers, and stores data in a global data memory to monitor and control the entire system. The manager is actually one of the packages operating in the cluster system. If a fault or failure occurs in the manager or in the computer running the manager, the manager is re-started on another computer by a cluster daemon. A “daemon” is a program that performs a utility (housekeeping or maintenance) function without being requested or even known by the user. A daemon sits in the background and is called into play only when needed—for example, to help correct an error from which another program cannot recover.
The present invention differs from this prior art in that the prior invention focuses on restoring a failed application (package) on a different cluster node after the occurrence of a failure. It is not clear if the prior invention addresses cluster node failures and cluster software failures, or just package failures. The present invention is aimed at monitoring for “impending” system or cluster failures and, in the case of system failures, initiating proactive failover “before” the occurrence of an actual failure. A proactive failover takes less time than a failover initiated by application failure. The reduced failover time results in decreased application downtime.
Another prior art method to which the method of the present invention generally relates is detailed in U.S. Pat. No. 5,287,453 entitled FAST REMOTE FILE ACCESS FACILITY FOR DISTRIBUTING FILE ACCESS REQUESTS IN A CLOSELY COUPLED COMPUTER SYSTEM. The method of the prior invention includes a plurality of independently operated computer systems located in close proximity to each other. Each system includes a system bus, a memory, and a set of local peripheral devices which connect in common to the system bus. The computer systems are interconnected for transferring messages to each other through the channels of a high speed cluster controller which connect to the system buses. Each system further includes a cluster driver which transfers the messages between the memory of the computer system and the corresponding cluster controller channel when the system is configured to operate in a cluster mode of operation. User application programs issue monitor calls to access files contained on a peripheral device(s). The fast remote file access (FRFA) facility included in each system, upon detecting that the peripheral device is not locally attached, packages the monitor call and information identifying the user application into a message. The message is transferred through the cluster driver and cluster controller to the FRFA of the computer system to which the peripheral device attaches. The monitor call is executed and the response is sent back through the cluster controller and delivered to the user application in a manner so that the peripheral device of the other computer systems appears to be locally attached and the monitor call appears to be locally executed.
The present invention differs from that prior art in that the prior invention deals with making remote cluster peripheral devices appear as local. The present invention, on the other hand, deals with application availability and downtime issues.
Yet another prior art method to which the method of the present invention generally relates is detailed in U.S. Pat. No. 6,078,957 entitled METHOD AND APPARATUS FOR A TCP/IP LOAD BALANCING AND FAILOVER PROCESS IN AN INTERNET PROTOCOL (IP) NETWORK CLUSTERING SYSTEM. The prior art method is a method and apparatus for monitoring packet loss activity in an Internet Protocol (IP) network clustering system which can provide a useful, discrete and tangible mechanism for controlled failover of the TCP/IP network cluster system. An adaptive interval value is determined as a function of the average packet loss in the system, and this adaptive interval value used to determine when a cluster member must send a next keepalive message to all other cluster members, and wherein the keepalive message is used to determine network packet loss.
The present invention differs from this prior art in that the prior invention deals with monitoring for loss of data during network transfer between cluster nodes and determining ways to minimize such loss. The prior invention relates to cluster downtime arising out of network problems. The present invention, however, focuses on issues arising out of operating system, application and cluster software failures.
Yet another prior art method to which the method of the present invention generally relates is detailed in U.S. Pat. No. 5,426,774 entitled METHOD FOR MAINTAINING A SEQUENCE OF EVENTS FUNCTION DURING FAILOVER IN A REDUNDANT MULTIPLE LAYER SYSTEM. The prior art method involves a process control system having a redundant multilayer hierarchical structure, where each node of a layer being redundant sequence of events inputs is received from field devices by an input/output processor (IOP). The IOP is a digital input sequence of events (DISOE) IOP, the IOP being the lowest layer of the hierarchical structure. The IOP interfaces with a controller at the next layer of the hierarchy. This method for reliably maintaining a sequence of events functions during a failover of any of the redundant nodes, involves the steps of maintaining a log, a circular list, by the local DISOE. The circular list is a rolling log of all sequence of events data for a predefined time period. When a failover occurs, the new primary commands an event recovery. The event recovery process freezes the log and uses the information in the log to recreate the events data. The freeze operation inhibits background-purge activity for the log thereby avoiding the deletion of information past the defined time. New events data is still entered in the log. Once the log has been processed, the freeze operation is negated. The recreated data is transmitted to the controller in accordance with a predefined protocol, thereby avoiding the loss of any events data as a result of the failover.
The present invention differs from this prior art in that the prior invention deals with handling of an event's data during a failover related to a multi-layer process control system. The prior invention, unlike the present invention, does not relate to computer cluster systems that host enterprise software applications for high-availability. In addition, the present invention does not deal with application data directly. Instead, the present invention communicates with software applications through MSCS cluster software.
Yet another prior art method to which the method of the present invention generally relates is detailed in U.S. Pat. No. 5,483,637 entitled EXPERT BASED SYSTEM AND METHOD FOR MANAGING ERROR EVENTS IN A LOCAL AREA NETWORK. In this prior art method, an expert based system for managing error events in a local area network (LAN) is described. The system includes an inference engine and a knowledge base storing data defining a plurality of causal relationships. Each of the causal relationships associates an error message with a cause, at least one implied relationship, and at least one trigger relationship. The inference engine accesses the knowledge base in response to a receiver error message to identify the error message and retrieve from the knowledge base its possible causes. The received error message is compared with other already received error messages to filter out repeated error messages. Already received error messages are examined to determine whether a triggering error message has arrived and, if so, the received error is discarded. The received error message is compared with existing diagnostic problems, termed a cluster, to determine if the received error message shares common causes with all error messages in the cluster and, if so, the received error message is added to the cluster. The causes in a cluster are evaluated to determine whether one cause in a cluster implies another cause and, if so, the implied cause is discarded. A user interface connected to the inference engine is used for reporting problems including correlated error messages, a cause and a recommended action for fixing the cause.
The present invention differs from this prior art in that the prior invention relates to detecting problems or errors in a LAN environment and providing feedback to the user about possible fixes. The likely goal of the prior invention is to help the user troubleshoot LAN errors. The present invention, on the other hand, operates on a Microsoft Windows Cluster (based on MSCS), and addresses application availability, as opposed to LAN availability. In addition, the present invention executes actions (including pro-active failover) that lessen application downtime.