The present invention relates to the field of computer system networks. In particular, the present invention pertains to a software-based module for augmenting a server computer system to perform network interface card fault tolerance and fail over.
Computer systems linked to each other in a network are commonly used in businesses and other organizations. Computer system networks (xe2x80x9cnetworksxe2x80x9d) provide a number of benefits for the user, such as increased productivity, flexibility, and convenience as well as resource sharing and allocation.
Networks are configured in different ways depending on implementation-specific details such as the hardware used and the physical location of the equipment, and also depending on the particular objectives of the network. In general, networks include one or more server computer systems, each communicatively coupled to numerous client computer systems.
In contemporary networks, server computer systems are typically coupled to the network using more than one network interface card (NIC). Multiple NICs increase the total available bandwidth capacity for transmitting and sending data packets. Multiple NICs also provide resiliency and redundancy if one of the NICs fails. In the case of a failure of a NIC, one of the other NICs is used to handle the traffic previously handled by the failed NIC, thereby increasing overall system reliability. Therefore, the client computer systems in communication with the server computer system through a particular NIC are not cut off from the server should that NIC fail. Accordingly, it is necessary to be able to detect when a NIC fails and, when a failed NIC is detected, to switch to a functioning NIC (this is referred to as fault tolerance and fail over support) as quickly as possible in order to minimize the time until a communication link is re-established between the server computer system and the client computer systems.
Prior Art FIG. 1 is an illustration of exemplary network 50 including two virtual local area networks (VLANs). In network 50, client computer system 140 (e.g., a workstation) is in one VLAN, and client computer systems 141, 142 and 143 are in a second VLAN. Both VLANs are serviced by server computer system 160. A data packet sent by server computer system 160 contains address information that is used to identify the particular client computer system(s) to which the data packet is to be sent. In addition, the data packet is tagged with a VLAN identifier that identifies the destination VLAN. The methods for addressing a data packet in a network comprising multiple VLANs are well known in the art; one method is defined by the IEEE 802.1Q standard.
Switches 150 and 151 are able to read the VLAN identifier and the other address information contained in the data packet and direct the data packet accordingly. Thus, switch 150 reads the VLAN identifier and will direct the data packet to client computer system 140 if appropriate. Otherwise, the data packet proceeds to switch 151, which directs the data packet to the proper client computer system (e.g., client computer systems 141, 142 or 143) depending on the address information contained in the data packet.
One prior art technique for fault tolerance and fail over support utilizes a switch-dependent protocol implemented using server computer system 160 and switches 150 and 151. This prior art technique also requires NICs that are specifically designed for compatibility with switches 150 and 151 and the protocol being used. This prior art technique is problematic because it requires the use of a specific type of hardware (e.g., a specific type of NIC compatible with a specific type of switch). Thus, this prior art technique is not suitable for different types of hardware (e.g., NICs and switches). In particular, the prior art is not suitable for legacy hardware already present in a network.
Another drawback to this type of prior art technique is that the switch must be designed with the capability to implement the fault tolerance and fail over schemes. Thus, the complexity and the cost of the switch are substantially increased. Even so, the capabilities of the switch are relatively limited, and so the schemes for providing fault tolerance and fail over support are also limited. In addition, the cost of implementing this type of prior art technique is increased by the need to replace or upgrade legacy devices.
Prior art techniques for fault tolerance and fail over support are also not capable of detecting a partial failure of a NIC; that is, for example, they are not capable of detecting a NIC failure if the NIC is not able to properly transmit but continues to receive. These prior art techniques rely on the NIC to notify the server computer system protocols that the NIC is not functioning. However, consider the case in which a particular NIC is receiving and transmitting, but the outgoing data packets are not be received at their destination because of a failure that solely affects that NIC, such as, for example, a loose cable. In this case, the NIC is not aware that data packets it is transmitting are not reaching their destination. Because the NIC continues to receive properly, the NIC believes it is properly performing both of its send and receive functions. The NIC therefore does not notify the server computer system protocols that it is not functioning.
Thus, a disadvantage to the prior art is that some NIC failures are not detected, in particular partial failures, and so the overall reliability and performance of the server computer system and the network are reduced. This disadvantage is compounded because those who are responsible for maintaining the network will also not be aware of the failure and so cannot implement a fix; thus, the server computer system and network can continue to operate indefinitely with reduced reliability and performance. Furthermore, if the failure is detected, it may be necessary to test all of the NICs in order to isolate the failed NIC.
Accordingly, a need exists for a system and method that implement fault tolerance and fail over support wherein the system and method are not limited by the capabilities of a switch. A need also exists for a system and method that satisfy the above need, are switch-independent, and can be used with legacy hardware (e.g., switches and NICs). In addition, a need exists for a system and method that satisfy the above needs, can detect partial NIC failures (such as the inability of the NIC to either send or receive), and can identify which NIC has failed. Furthermore, a need exists for a system and method that satisfy the above needs and quickly accomplishes fail over to a functioning NIC in order to minimize the time during which the communication link between the server computer system and client computer systems is not available.
The present invention provides a system and method that implement fault tolerance and fail over support wherein the system and method are not limited by the capabilities of a switch. The present invention also provides a system and method that satisfy the above need, are switch-independent, and can be used with legacy hardware (e.g., switches and NICs). Furthermore, the present invention provides a system and method that satisfy the above needs, can detect partial NIC failures (such as the inability of the NIC to either send or receive), and can identify which NIC has failed. In addition, the present invention provides a system and method that satisfy the above needs and quickly accomplishes fail over to a functioning NIC in order to minimize the time during which the communication link between the server computer system and client computer systems is not available.
Specifically, in one embodiment, the present invention pertains to a method for detecting a non-functioning network interface card (NIC) in a server computer system adapted to have a plurality of network interface cards coupled thereto and communicatively coupled to client computer systems in a network. A directed packet is sent from a first NIC to a second NIC, and a direct packet is also sent from the second NIC to the first NIC. The server computer system uses a fault tolerance and fail over support scheme to monitor the NICs to determine whether the directed packet from the first NIC is received by the second NIC. The server computer system also monitors the first NIC to determine whether the directed packet from the second NIC is received by the first NIC. The server computer system determines whether the first NIC is functioning using the results from the monitoring.
In one embodiment, the first NIC sends a directed packet to a first plurality of NICs and a second plurality of NICs each send a directed packet to the first NIC. The server computer system monitors the first plurality of NICs to determine whether they receive the first directed packet, and the server computer system also monitors the first NIC to determine whether it receives each directed packet sent from the second plurality of NICs.
In one embodiment, when the first NIC is determined to be non-functioning, the functions of the first NIC are automatically switched from the first NIC to one of the plurality of NICs. A broadcast packet is sent from the server computer system to the client computer systems. The broadcast packet contains a media access control (MAC) address for the NIC that replaces the first NIC, and each client computer system replaces the MAC address for the first NIC with the MAC address for the second NIC in its memory cache.
In one embodiment, an indication is provided to the server computer system when the directed packet from the first NIC is not received by the second NIC or when the directed packet sent by the second NIC is not received by the first NIC. In this embodiment, the indication is a cable disconnect message or a link lost status message.
These and other objects and advantages of the present invention will become obvious to those of ordinary skill in the art after having read the following detailed description of the preferred embodiments which are illustrated in the various drawing figures.