1. Field of the Invention
The present invention relates to a method and system for switching between duplicated network interface adapters attached to a host computer to provide fault-resilient network communication functions. More particularly, the present invention relates to a method and system for automatically switching between duplicated network interface adapters to achieve quick recovery from failure without disrupting ongoing host-to-host communications based on the Transmission Control Protocol/Internet Protocol (TCP/IP).
2. Description of the Related Art
The TCP/IP protocols are widely used in today's computer communications from local area networks to world-wide scale networks. There are some conventional network systems use TCP/IP to exchange data between a plurality of host computers. Each host is equipped with a network interface adapter (or network adapter for short) as one of its peripherals, coupled through a local channel or system bus, which serves as a hardware and software interface to provide the functionality of the TCP/IP protocol stack. In the case that high availability of communication functions is inevitable, as is in server systems, hardware duplication and backup techniques are introduced to the host design. More specifically, a host is equipped with two network adapters, one for main use and the other for backup, to provide the immediate readiness of an alternative functional unit to which operations can be automatically switched when the main adapter has failed. This structural arrangement is sometimes called a hot standby system, which greatly reduces unwanted machine downtime.
FIG. 7 is a diagram which shows the configuration of a conventional host-to-host communication system. A host 10 is connected to a local area network (LAN) 100 through duplex network adapters 11 and 12. Although both network adapters 11 and 12 are working concurrently as independent network entities, only one network adapter (e.g., network adapter 11) is allocated as an active resource for the host 10. The other network adapter 12 normally stays in standby mode, while monitoring the network activities. In addition to the host 10, another host 20 is connected to the LAN 100 through a network adapter 21. The LAN 100 is further linked to another LAN 200 via a router 30, and many computers 210, 220, etc. are connected to the LAN 200.
Those network nodes exchange data over the network by using the TCP/IP protocols. To this end, the devices are provided with different addresses so that they will be uniquely identified. For instance, different IP addresses "A" and "B" are set to the hosts 10 and 20, respectively, which are sometimes called "logical addresses."
While IP addresses are defined for use in the internet layer protocols as integral part of the TCP/IP protocol suite, the lower layer protocol (i.e., datalink) requires another kind of address information to deliver packets within a local portion of a network, or a subnet. Such network addresses are referred to as "MAC addresses," which are sometimes called "physical addresses" as opposed to the logical addresses. MAC addresses are assigned to individual network adapters which provide capabilities of lower-layer communication control. In the system of FIG. 7, for instance, MAC addresses "a" and "c" are given to the network adapters 11 and 12 of the host 10. Similarly, a MAC address "b" is assigned to the network adapter 21 of the host 20. The association between IP addresses and MAC addresses can be acquired by using the Address Resolution Protocol (ARP) as will be described later on. The hosts or network adapters individually maintain a list of IP addresses and the MAC addresses that correspond to them by monitoring ARP messages. This list is called a "network table," or "ARP cache."
Generally, in TCP/IP communications, a failure that has happened in a router device on a certain connection path between a specific source and destination nodes will initiate a process of finding an alternate route to reach the same destination after expiration of a predetermined time. Therefore, the hosts can continue communication by using another router available, through an alternative connection path that has been found.
Likewise, in the case that a fault is detected with the main network adapter of a certain host, the operations will be automatically switched to another network adapter prepared for backup. However, there exists such a problem with conventional systems that they are unable to switch from the main to the backup in an instant fashion. It actually takes a long time interval--twenty minutes, for example. The rest of this section will discuss the problem with conventional systems in detail.
Referring now to FIG. 8, the operation of the conventional communication system will be explained below. FIG. 8 is a sequence diagram which shows how to switch between duplicated network adapters in the conventional network system of FIG. 7. Hosts 10 and 20 communicate according to the following steps.
(S71) The host 10 requests the network adapter 11 to perform a data transmission to the host 20. More specifically, the following data message is passed to the network adapter 11.
"(B) (A) (DATA)" PA1 "(BCAST) (a) (B) (A)" PA1 (c.vertline.a:A), PA1 "(b) (a) (B) (A) (DATA)", PA1 "(b) (c) (B) (A) (DATA)" PA1 "(a) (b) (A) (B) (DATA)" PA1 "(BCAST)(c) (B) (A)" PA1 "(b) (c) (B) (A) (DATA) "
Here, the first field "(B)" is a destination IP address, and the second field "(A)" is a source IP address. The third field is a payload, or the main body of the data message addressed to the host 20. Note here that the message format conventions used in this specification are much simplified for illustrative purposes, while the message structure in actual implementations should conform to the standard protocol specifications for TCP/IP.
(S72) If the destination MAC address "b" was known, the network adapter 11 could readily transmit the data message over the network after adding the MAC addresses to each packet header. At this initial stage, however, the network adapter 11 is unable to send the message at once because the MAC address corresponding to the destination IP address "B" is not registered in a network table 11a of the network adapter 11. Accordingly, the network adapter 11 suspends the transmission of that message for a while and, instead, transmits an ARP request message containing the following information to all nodes on the network to investigate the MAC address of the destination.
As mentioned earlier, ARP is the acronym of Address Resolution Protocol, which is used to ask every node on the network what the MAC address for a particular IP address is. Here, the first field "(BCAST)" represents a code used to call up all MAC addresses simultaneously, allowing the network adapter 11 to broadcast the same message toward every node on the network. The second field "(a)" indicates the source MAC address, i.e., the MAC address of the network adapter 11 that is currently activated. The third and fourth fields "(B)" and "(A)" represent the destination IP address and source IP address, respectively.
(S73) While being in a standby state (i.e., not serving the host 10), the backup network adapter 12 receives ARP broadcast messages and acquires information about network address mapping, if any. The network adapter 12 thus takes in the ARP request message of step S72, and enters the acquired information into its own network table 12a. More specifically, a new table entry for the correspondence between the IP address "A" and MAC address "a" is entered to the network table 12a. This entry is represented in such a form as
where the first term "c" means that the owner of the network table has a MAC address "c," and the second and third terms "a" and "A" separated by a colon represent a pair of MAC and IP addresses that have been received and registered. This convention of the network table entries is used throughout the present specification, although the accompanying drawings show them in more graphical form.
(S74) The ARP request message transmitted in step S72 now reaches the network adapter 21 and is passed to the host 20. The host 20 registers a new entry (b.vertline.a:A) to its own network table 20a, which represents a MAC address "a" associated with the IP address "A" of the host 10.
(S75) The host 20 recognizes that the ARP request message is destined for itself, and thus replies to the host 10 by returning an ARP response message via the network adapter 21.
(S76) The network adapter 11 receives the ARP response message transmitted from the host 20 in step S75. It then registers an entry (a.vertline.b:B) to its network table 11a, which shows the association between the MAC address "b" and IP address "B" of the host 20.
(S77) Now that the destination MAC address of the host 20 is available in the network table 11a, the network adapter 11 executes transmission of the data message that has been suspended since step S72. Here, the IP and MAC addresses are added to the message as follows.
where the first field "(b)" shows the destination MAC address, the second field "(a)" the source MAC address, the third field "(B)" the destination IP address, the fourth field "(A)" the source IP address, and the fifth field "(DATA)" the transmission data.
(S78) It is assumed here that some trouble has happened to the main network adapter 11.
(S79) The host 10 detects the failure of the network adapter 11, and initiates a process of switching from the main network adapter 11 to the backup network adapter 12.
(S80) The host 10 sends an activation order to the network adapter 12 that has been in standby mode, thereby enabling the activated network adapter 12 to provide the host 10 with communication functions, in place of the main network adapter 11.
(S81) To send another data message to the host 20, the host 10 requests its transmission to the network adapter 12.
(S82) The network adapter 12 transmits the message supplied by the host 10, adding address information as follows.
(S83) In response to the message sent from the host 10, the host 20 performs some processes and reports the result status to the host 10. Because neither host 20 nor network adapter 21 is aware that the network adapter 1a has lost its functionality, the old table entry (b.vertline.a:A) still remains unchanged in the network table 20a in the host 20. This naturally causes the response message addressed to the host 10 to have a destination MAC address of "a" that points to the failed network adapter 11.
(S84) However, this message cannot reach the destination host 10 because the network adapter 11 is not operational due to the trouble. The response message sent from the host 20 will be discarded after all.
(S85) Such disruption of communication will continue until the relevant entry of the network table 20a in the host 20 is updated with a correct value. Unfortunately, however, the network table 20a will never be corrected unless ARP request/response messages pertaining to the host 10 are received.
On the other hand, the TCP/IP protocol standards require the network control software to nullify such a network table entry for a particular network address that exhibits no activity for a predetermined timeout period, which is typically twenty minutes. This simply means that the host 20 cannot communicate with the host 10 for twenty minutes after all. Only thing the hosts can do in such a situation is just to wait for the expiration of the 20-minute timeout. Upon timeout, the above-described mechanism is activated to nullify the invalid entries. The old entry pertaining to the failed network adapter 11's MAC address is now removed from the network table 20a, and the hosts 10 and 20 will be able to communicate with each other.
(S86) Suppose that the 20-minutes timeout period has expired, and the host 10 has another data message to send to the host 20. The network adapter 12 receives a transmission request from the host 10.
(S87) At this stage, the entry concerning the MAC address of the network adapter 21 cannot be found in the network table 12a. Therefore, the network adapter 12 transmits the following ARP request message in broadcast mode to obtain the destination MAC address.
(S88) The destination network adapter 21 receives the ARP broadcast and passes it to the host 20, thus allowing the IP and MAC addresses of the host 10 to be registered into the network table 20a. More specifically, a new entry (b.vertline.c:A) showing the association between the IP address "A" and MAC address "c" is entered to the network table 20a.
(S89) The host 20 returns an ARP response message via the network adapter 21 to inform the host 10 of the MAC address "b" in question.
(S90) The network adapter 12 receives this ARP response message and registers an entry (c.vertline.b:B) to its network table 12a.
(S91) The network adapter 12 transmits the data message that has been suspended since step S87 by setting its destination MAC and IP addresses as follows.
As described above, the conventional system cannot immediately remove the MAC address of the failed network adapter 11, once it is registered in the target host's network table 20a. Since the MAC address remains invalid until the expiration of 20-minute timeout in step S85, the host 20 is unable to send any messages to the network adapter 12 during this period. Although the hot-standby configuration allows the host 10 to detect errors in the main network adapter 11 and switch the operations to the backup network adapter 12, the communication is disrupted because of the inability to update network tables in other hosts. Some applications using connection-oriented protocols of TCP may produce timeout errors during the 20-minute absence of the communication, resulting in an interruption of the ongoing application processes.
As such, the host 10 cannot enjoy the advantages of duplex network adapter configurations since they do not work effectively at all in the conventional system structure.