1. Technical Field
The present invention relates in general to improved high availability cluster management in particular to improved high availability (HA) cluster management during failover. Still more particularly, the present invention relates to managing failover of J2EE compliant middleware in a HA system.
2. Description of the Related Art
For retailers, banks, and other on-line services where load and demand constantly fluctuate and where handling each customer request is of utmost importance, high availability (HA) systems have been developed to handle mission-critical operations. In general, an HA system is a system designed to eliminate or minimize the loss of service due to either planned or unplanned outages among components of a network system. The key method of providing an HA system is through redundant hardware and software components grouped into a cluster of servers.
Redundancy is important in an HA system because when a failure occurs in one node of the cluster, the system transfers the processes performed by one node to another. In a two-node HA cluster, for example, one node is typically designate as the primary node and the other node is typically designated as the backup node. In general, the primary node initially runs an application when a cluster is started. In addition, in general, a backup node is designated that will run the application if the primary node fails. The HA cluster system will typically implement a cluster manager process that periodically polls (or checks the heartbeat) of the primary node to determine if it is still active. If a “heartbeat” is not detected, then the cluster manager moves the software process to another server in a cluster.
In general, HA systems can be configured in an active-active state or an active-standby state. For a two-node HA cluster in an active-active state, both nodes are active. Thus, the standby node shares some state with the active primary node. For a two-node HA cluster in an active-standby state, the backup node is in standby mode. In standby mode, the components of the backup node must be initialized and brought online with the failover occurs.
An important characteristic of an HA system is the recovery time. In general, the recovery time in a HA system is the time taken for a backup node to take over an application from a failed primary node. Recovery time may be effected by whether an HA system is configured in active-active state or active-backup state.
Recovery time is particularly important in a sales based HA system because retailers may lose valuable business if a customer is not able to complete transactions quickly. A delay of even 30 seconds used for the recovery time diminishes a retailer's business transactions.
Another important characteristic of an HA system is to achieve little or no loss of data during failover. In particular, it is important to achieve little or no loss of committed data. For example, it is not advantageous to lose valuable information about a customer order or customer information during failover.
To address the issues of recovery time and loss of data, many developers have developed customized HA software services to control applications in a custom environment which often requires new hardware. These solutions are often expensive and do not take advantage of open source technologies that allow for portability of applications across multiple platforms.
Alternatively, in an effort to further open source technology and portability across platforms, Java™ 2 platform, Enterprise Edition (J2EE) provides a reusable component model for use in building web applications. J2EE defines a standard application model, a standard platform for hosting applications, a compatibility requirement and an operation definition of the J2EE platform. An advantage of this open source model is that multiple developers can implement the J2EE model with additional components and configurations, yet all J2EE applications will run on a J2EE based system.
Many developers, such as International Business Machines, Corp. (IBM™), have developed software that implement the J2EE model. This software often fills in gaps not specified in the J2EE framework. IBM™, in particular, has developed a middleware stack of J2EE compliant software products that when implemented on a cluster of servers, support J2EE applications. In general, the middleware stack includes a web server, a database server, and a universal Internet application server. Specifically, this stack may include products such as the IBM DB2™ UDB Enterprise Edition, the IBM HTTP Server, and the IBMWebSphere™ Application Server.
In addition, in an effort to further the impact of open source technology and portability across platforms, Linux provides an inexpensive, platform independent operating system. Developers of Linux continue to add functions to the operating system that can be implemented in an open source manner by other developers. Some of these functions, such as “heartbeat” and distributed replicated block device (drbd), are implemented with the Linux operating system to assist in configuring HA systems.
In view of the foregoing, it would be advantageous to provide a method, system, and program for implementing an open source based HA system that delivers mission-critical services with a minimized recovery time and loss of data. In particular, it would be advantageous to implement a HA system supporting failover of a J2EE compliant middleware stack through an efficient configuration of open source functions, such that as additional components are added to the middleware stack, efficient failover of each component is supported.