The present invention relates to network communications and more particularly to network communications to a cluster of data processing systems.
The Internet Protocol (IP) is a connectionless protocol. IP packets are routed from originator through a network of routers to the destination. All physical adapter devices in such a network, including those for client and server hosts, are identified by an IP Address which is unique within the network. One valuable feature of IP is that a failure of an intermediate router node or adapter will not prevent a packet from moving from source to destination, as long as there is an alternate path through the network.
In Transmission Control Protocol/Internet Protocol (TCP/IP), TCP sets up a connection between two endpoints, identified by the respective IP addresses and a port number on each. Unlike failures of an adapter in an intermediate node, if one of the endpoint adapters (or the link leading to it) fails, all connections through that adapter fail, and must be reestablished. If the failure is on a client workstation host, only the relatively few client connections are disrupted, and usually only one person is inconvenienced. However, an adapter failure on a server means that hundreds or thousands of connections may be disrupted. On a S/390 with large capacity, the number may run to tens of thousands.
To alleviate this situation, International Business Machines Corporation introduced the concept of a Virtual IP Address, or VIPA, on its TCP/IP for OS/390 V2R5 (and added to V2R4 as well). A VIPA is configured the same as a normal IP address for a physical adapter, except that it is not associated with any particular device. To an attached router, the TCP stack on OS/390 simply looks like another router. When the TCP stack receives a packet destined for one of its VIPAs, the inbound IP function of the TCP stack notes that the IP address of the packet is in the TCP stack""s Home list of IP addresses and forwards the packet up the TCP stack. The xe2x80x9chome listxe2x80x9d of a TCP stack is the list of IP addresses which are xe2x80x9cownedxe2x80x9d by the TCP stack. Assuming the TCP stack has multiple adapters or paths to it (including a Cross Coupling Facility (XCF) path from other TCP stacks in a Sysplex), if a particular physical adapter fails, the attached routing network will route VIPA-targeted packets to the TCP stack via an alternate route. The VIPA may, thus, be thought of as an address to the stack, and not to any particular adapter.
While the use of VIPAs may remove hardware and associated transmission media as a single point of failure for large numbers of connections, the connectivity of a server can still be lost through a failure of a single stack or an MVS image. The VIPA Configuration manual for OS/390 tells the customer how to configure the VIPA(s) for a failed stack on another stack, but this is a manual process. Substantial down time of a failed MVS image or TCP stack may still result until operator intervention to manually reconfigure the TCP stacks in a Sysplex to route around the failed TCP stack or MVS image.
While merely restarting an application with a new IP address may resolve many failures, applications use IP addresses in different ways and, therefore, such a solution may be inappropriate. The first time a client resolves a name in its local domain, the local Dynamic Name Server (DNS) will query back through the DNS hierarchy to get to the authoritative server. For a Sysplex, the authoritative server should be DNS/Workload Manager (WLM). DNS/WLM will consider relative workloads among the nodes supporting the requested application, and will return the IP address for the most appropriate available server. IP addresses for servers that are not available will not be returned. The Time to Live of the returned IP address will be zero, so that the next resolution query (on failure of the original server, for example) will go all the way back to the DNS/WLM that has the knowledge to return the IP address of an available server.
However, in practice, things do not always work as described above. For example, some clients are configured to a specific IP address, thus requiring human intervention to go to another server. However, the person using the client may not have the knowledge to reconfigure the client for a new IP address. Additionally, some clients ignore the Time to Live, and cache the IP address as long as the client is active. Human intervention may again be required to recycle the client to obtain a new IP address. Also, DNSs are often deployed as a hierarchy to reduce network traffic, and DNSs may cache the IP address beyond the stated Time to Live even when the client behaves quite correctly. Thus, even if the client requests a new IP address, the client may receive the cached address from the DNS. Finally, some users may prefer to configure DNS/WLM to send a Time to Live that is greater than zero, in an attempt to limit network-wide traffic to resolve names. Problems arising from these various scenarios may be reduced if the IP address with which the client communicates does not change. However, as described above, to affect such a movement of VIPAs between TCP stacks requires operator intervention and may result in lengthy down times for the applications associated with the VIPA.
Previous approaches to increased availability focused on providing spare hardware. The High-Availability Coupled Multi-Processor (HACMP) design allows for taking over the MAC address of a failing adapter on a shared medium (LAN). This works both for a failing adapter (failover to a spare adapter on the same node) or for a failing node (failover to another node via spare adapter or adapters on the takeover node.) Spare adapters are not used for IP traffic, but they are used to exchange heartbeats among cluster nodes for failure detection. All of the work on a failing node goes to a single surviving node. In addition to spare adapters, and of course access to the same application data, the designated failover node must also have sufficient spare processing capacity to handle the entire failing node workload with xe2x80x9cacceptablexe2x80x9d service characteristics (response and throughput).
Automatic restart of failing applications also provides faster recovery of a failing application or node. This may be acceptable when the application can be restarted in place, but is less useful when the application is moved to another node, unless the IP address known to the clients can be moved with the application, or dynamic DNS updates with alternate IP addresses can be propagated to a DNS local to clients sufficiently quickly.
Other attempts at error recovery have included the EDDIE system described in a paper titled xe2x80x9cEDDIE, A Robust and Scalable Internet Serverxe2x80x9d by A. Dahlin, M. Froberg, J. Grebeno, J. Walerud, and P. Winroth, of Ericsson Telecom AB, Stockholm, Sweden, May 1998. In the EDDIE approach a distributed application called xe2x80x9cIP Address Migration Applicationxe2x80x9d controls all IP addresses in the cluster. The cluster is connected via a shared-medium LAN. IP address aliasing is used to provide addresses to individual applications over a single adapter, and these aliases are located via Address Resolution Protocol (ARP) and ARP caches in the TCP/IPs. The application monitors all server applications and hardware, and reallocates aliased IP addresses in the event of failure to surviving adapters and nodes. This approach allows applications of a failing node to be distributed among surviving nodes, but it may require the monitoring application to have complete knowledge of the application and network adapter topology in the cluster. In this sense, it is similar to existing Systems Management applications such as those provided by International Business Machines Corporation""s Tivoli(copyright) network management software, but the IP Address Migration Application has direct access to adapters and ARP caches. The application also requires a dedicated IP address for inter-application communication and coordination.
In light of the above discussion, a need exists for improvements in recovery or movement of IP addresses within a cluster of data processing systems.
In view of the above discussion, it is an object of the present invention to provide for automatic recovery from failures of a communication protocol stack in a cluster of computers.
A further object of the present invention is to allow for recovery from protocol stack failures without requiring clients or servers connected to the failed protocol stacks to update cached addresses associated with the failed stack.
Still another object of the present invention is to allow automatic recovery from a protocol stack failure without requiring dedicated backup hardware.
Another object of the present invention is to provide for recovery from a protocol stack failure without reservation of capacity on a recovering system.
Yet another object of the present invention is to allow automatic recovery from a protocol stack failure without requiring additional monitoring applications or additional addresses dedicated to a recovery mechanism.
A still further object of the present invention is to allow automatic recovery from a protocol stack failure without requiring a single backup to be capable of handling the entire workload of the failed protocol stack.
These and other objects of the present invention may be provided by methods, systems, and computer program products which provide dynamic Virtual IP Addresses (VIPAs). Dynamic VIPAs are VIPAs which may be automatically moved from one protocol stack to another and are, thus, associated with an application or application group (recoverable entity) as opposed to a network adapter or host/system.
Thus, the present invention may provide for transferring a Virtual IP Address (VIPA) from a first application instance to a second application instance, where the first application instance and the second application instance are executing on a cluster of data processing systems having a plurality of communication protocol stacks associated therewith and where the first application instance is associated with a first of the plurality of communication protocol stacks and the second application instance is associated with a second of the plurality of communication protocol stacks by distributing among the plurality of communication protocol stacks a list of dynamic VIPAs. A hierarchy of backup communication protocol stacks for the dynamic VIPAs is determined based on the list of dynamic VIPAs. Upon receiving notification of failure of the first communication protocol stack the second communication protocol stack evaluates the hierarchy of backup communication protocol stacks to determine if it is the next communication protocol stack in the hierarchy of backup communication protocol stacks for the VIPA associated with the first application instance. If so, then the VIPA associated with the first application instance is transferred to the second communication protocol stack associated with the second application instance.
In a further embodiment of the present invention where a specific address is associated with an application instance, transfer of such specific dynamic VIPAs may be accomplished by establishing a range of dynamic VIPAs associated with the communication protocol stacks which are valid specific dynamic VIPAs. Information as to the bind status of dynamic VIPAs associated with the communication protocol stacks is then distributed among the communication protocol stacks to allow the communication protocol stacks to determine if a dynamic VIPA has been bound to an application instance. Upon notification of failure of a communication protocol stack the bind status of dynamic VIPAs bound to the failed communication protocol stack is reset so as to allow bind calls to the dynamic VIPAs bound to the failed communication protocol stack. When a request to bind the second instance of the application with the first VIPA of the first application instance is received at another communication protocol stack, the second application instance is bound to the first VIPA at the second communication protocol stack so as to allow communications to the second application instance through the second communication protocol stack utilizing the first VIPA.
By communicating between the protocol stacks, the hierarchy of backups may be established so that the protocol stacks themselves may handle takeover of dynamic VIPAs for failed stacks or operating system images. Furthermore, this recovery may be automatic as the protocol stacks are notified of the failure of another protocol stack in the cluster and has available sufficient information to know which protocol stack should take over a dynamic VIPA of a failed protocol stack. Thus, the automatic recovery may be accomplished without the need for a global monitoring application to control the transfer of addresses as the transfer is controlled by the protocol stacks themselves. Also, because the recovery utilizes VIPAs, clients or servers connected to the failed protocol stacks may not need to update cached addresses associated with the failed stack. Because the dynamic VIPAs associated with a failed stack are transferred to other protocol stacks in the cluster, there may be no need to provide dedicated backup hardware.
In particular embodiments of the present invention, the list of dynamic VIPAs is distributed by broadcasting from communication protocol stacks supporting dynamic VIPAs, a VIPA define message which provides identification of active dynamic VIPAs associated with the broadcasting communication protocol stack and which provides identification of dynamic VIPAs for which the broadcasting communication protocol stack is a backup communication protocol stack. Furthermore, the VIPA define message may include a priority associated with the dynamic VIPAs for which the broadcasting communication protocol stack is a backup communication protocol stack.
When a communication protocol stack receives a broadcast VIPA define message, the communication protocol stack may establish, for each VIPA in the VIPA define message, a list of communication protocol stacks associated with the VIPA so as to provide a list of the hierarchy of backup communication protocol stacks associated with the VIPA. Preferably, the list of communications protocol stacks is in the order of priority of the communication protocol stacks.
In a still further embodiment of the present invention, if a received broadcast VIPA define message defines a communication protocol stack as a primary communication protocol stack for a dynamic VIPA identified as an active dynamic VIPA at the communication protocol stack receiving the VIPA define message, then the active dynamic VIPA may be deleted at the receiving communication protocol stack. Furthermore, the deletion of the active dynamic VIPA may be delayed if an active connection exists to the active dynamic VIPA. If so, then a VIPA delayed giveback message may be sent so as to notify other communication protocol stacks that the communication protocol stack which received the VIPA define message will delete the active dynamic VIPA when there are no connections to the active dynamic VIPA. Also, the communication protocol stack which transmitted the VIPA define message, responsive to receiving the VIPA delayed giveback message, may then broadcast a VIPA define message identifying the communication protocol stack which transmitted the VIPA define message as a backup protocol stack for the active dynamic VIPA having a highest priority.
In yet another embodiment of the present invention, notification of the restart of the first communication protocol stack may be provided to other of the plurality of communication protocol stacks. If the first communication protocol stack is the primary communication protocol stack associated with an active dynamic VIPA of the second communication protocol stack, then the active dynamic VIPA is deleted at the second communication protocol stack. The second communication protocol stack, responsive to deleting the active dynamic VIPA, may then broadcast a VIPA define message identifying the second communication protocol stack as a backup protocol stack for the active dynamic VIPA. As described above, the deletion of the active dynamic VIPA may be delayed if a connection exists to the active dynamic VIPA.
In a particular preferred embodiment of the present invention, dynamic VIPAs in the range of dynamic VIPAs and dynamic VIPAs in the list of dynamic VIPAs are mutually exclusive.
In another embodiment of the present invention, the request to bind the second application to the first VIPA at the second of the plurality of communication protocol stacks is a BIND call which specifies the first VIPA.
Furthermore, where a request is made for a specific dynamic VIPA, it may be determined if the requested VIPA is active on a communication protocol stack other than the communication protocol stack receiving the request. If so, the bind request may be rejected. If the bind call is not rejected, however, the communication protocol stacks may be notified that the requested dynamic VIPA is active at a communication protocol stack.
In yet another embodiment of the present invention, the request to bind an application to a specific dynamic VIPA is an IOCTL command which specifies the specific dynamic VIPA. In such a case, the dynamic VIPA may be activate on the communication protocol stack irrespective of whether the dynamic VIPA is active another communication protocol stack. The dynamic VIPA may then be deleted from all communication protocol stacks other than the communication protocol stack on which it was activated.
As will further be appreciated by those of skill in the art, the present invention may be embodied as methods, apparatus/systems and/or computer program products.