1. Field of the Invention
This invention generally relates to the field of cluster multiprocessing, and more specifically to monitoring of cluster computers for availability.
2. Description of Related Art
Clustering servers enables parallel access to data, which can help provide the redundancy and fault resilience required for business-critical applications. Clustering applications, such as High Availability Cluster Multi-Processing (HACMP) provided by International Business Machines (IBM) of Armonk, N.Y., provide tools to help install, configure and manage clusters in a highly productive manner. HACMP provides monitoring and recovery of clustered computer resources for use in providing data access and backup functions (e.g., a mission critical database). HACMP also enables server clusters to be configured for application recovery/restart to provide protection for business-critical applications through redundancy.
Cluster monitoring applications, such as Reliable Scalable Cluster Technology (RSCT) provided by IBM, provide error detection for TCP/IP based computer networks. RSCT is a clustering infrastructure that can be used by HACMP for providing higher-level recovery functions. RSCT sends messages, known as heartbeat messages, across each network interface connected to the network. When heartbeat messages are no longer received via a particular network interface, that network interface is considered dead or unconnected. The heartbeat technology requires that the RSCT software be able to direct heartbeat messages through a specific network interface. For this purpose, the IP address for each network interface must meet certain requirements so that the IP layer of the operating system will always direct the heartbeat message to the desired network interface. One of the requirements of the heartbeat technology is that each network interface on a node must be on a different subnet than all other network interfaces on that node.
More specifically, when a message is sent to an address, the IP layer routes that message to a network interface based on the destination address and the configuration of that network interface. This is known as “subnet routing” and the addresses involved must be organized in the proper subnets so that the routing layer directs the message to that specific network interface. Otherwise, if two addresses on the same node are in the same subnet, the routing function can send the messages across either of network interfaces (e.g., by always using one interface or the other, or by alternating between the interfaces), so the heartbeat function cannot monitor the individual network interfaces.
This address requirement creates a difficulty when complex or large networks are involved. For example, with 8 or more network interfaces per node as is now common, many subnets must be supplied. Manually assigning the proper subnet ranges and maintaining the addresses is non-trivial, especially with any network changes or maintenance. Further, this is only a requirement so that the RSCT software can make accurate determinations of individual network interface functionality.
HACMP software uses RSCT for monitoring network interfaces and provides “high availability” of network addresses by moving network addresses between network interfaces in response to failures. The process of moving the network address to a backup network interface is known as “recovery”. Users of HACMP must provide certain information about the network to HACMP, such as a list of the network interfaces connected to the network and corresponding network addresses, such that HACMP can properly perform the recovery function.
Currently, HACMP passes the above network information to RSCT for use in the heartbeat function. A drawback with this configuration is that in order to properly perform the heartbeat function the network addresses and network interfaces must conform to certain rules, such as requiring each network interface address in each node to be located on a separate subnet. These rules, however, are not necessarily required in the context of the recovery function. Regardless, because of their necessity to the heartbeat function, the user must define network interface addresses for recovery such that they meet the requirements for the heartbeat function.
Therefore a need exists to overcome the problems discussed above, and particularly for a way to more efficiently monitor availability of computers in a cluster.