1. Technical Field
The present invention relates in general to improved high availability cluster management and, in particular to remote cluster management of a high availability system. Still more particularly, the present invention relates to improved remote monitoring and management of multiple high availability systems in an enterprise network.
2. Description of the Related Art
For retailers, banks, and other on-line services where load and demand constantly fluctuate and where handling each customer request is of utmost importance, high availability (HA) systems have been developed to handle mission-critical operations. In general, an HA system is a system designed to eliminate or minimize the loss of service due to either planned or unplanned outages among components of a network system. The key method of providing an HA system is through redundant hardware and software components grouped into a cluster of servers.
Redundancy is important in an HA system because when a failure occurs in one node of the cluster, the system transfers the processes performed by one node to another. In a two-node HA cluster, for example, one node is typically designate as the primary node and the other node is typically designated as the backup node. In general, the primary node initially runs an application when a cluster is started. In addition, in general, a backup node is designated that will run the application if the primary node fails. The HA cluster system will typically implement a cluster manager process that periodically polls (or checks the heartbeat) of the primary node to determine if it is still active. If a “heartbeat” is not detected, then the cluster manager moves the software process to another server in a cluster.
An important characteristic of an HA system is the recovery time. In general, the recovery time in a HA system is the time taken for a backup node to take over an application from a failed primary node. Recovery time is particularly important in a sales based HA system because retailers may lose valuable business if a customer is not able to complete transactions quickly. A delay of even 30 seconds for the recovery time diminishes a retailer's business transactions.
Another important characteristic of an HA system is to achieve little or no loss of data during failover. In particular, it is important to achieve little or no loss of committed data. For example, it is not advantageous to lose valuable information about a customer order or customer information during failover.
To achieve a short recovery time and little or no loss of data during failure, it is important to initially combine hardware and software in such as manner that an HA system is built. After a HA system is initiated, however, it is important to monitor and adjust the configuration of the HA system to try to improve the efficiency of failovers and correction of other errors.
When configuring hardware and software for HA systems, many developers have developed customized HA software services to control applications in a custom environment which often requires new hardware. These solutions are often expensive and do not take advantage of open source technologies that allow for portability of applications across multiple platforms. Further, expensive server systems are often selected, in hopes that the power available in the server system will automatically increase the efficiency of failovers.
As an alternative, open source developers continue to expand open source technology with functions that can be configured when implementing HA systems. For example, Linux provides an inexpensive, platform independent operating system. Developers of Linux continue to add functions to the operating system that can be implemented in an open source manner by other developers. Some of these functions, such as “heartbeat” and distributed replicated block device (drbd), are implemented with the Linux operating system to assist in configuring HA systems.
While the Linux tools provide a framework for monitoring for failures and configuring the hardware used in HA systems, there is a need for additional monitoring and configuration capability. In particular, there is a need for a method of monitoring for failures, errors, and other non-ideal conditions in both the hardware and the software of a HA system and for monitoring when the open source HA tools detect failures and errors. Further, there is a need for remotely accumulating the monitored system status and then remotely facilitating reconfiguration of the HA system.
Moreover, typically multiple HA systems are combined in a network to form an enterprise system. Each HA system may service transactional requests for a different store within an enterprise, for example. There is a need for a method, system, and program for remotely accumulating the monitored system status of multiple HA systems within an enterprise, comparing the system status with performance requirements, and tracking hardware and software needs of each HA system within the enterprise.
Further, when implementing an HA system using an open source operating system framework, it would be advantageous to implement an open source compliant middleware layer to handle transaction requests. In particular, it would be advantageous to implement a Java™ 2 platform, Enterprise Edition (J2EE) compliant middleware stack that is: (1) controlled by open source based cluster management interfacing with a remote enterprise console; and (2) able to monitor and configure multiple HA systems in an enterprise network.