1. Field of the Invention
Embodiments of the present invention relate generally to the field of computer networking and more specifically to a technique for identifying a failed network interface card within a team of network interface cards.
2. Description of the Related Art
Modern computing devices may have one or more network interface cards (NICs) in a single system. This plurality of NICs allows the computing device to increase the system's communication bandwidth beyond what a single NIC could provide and is commonly referred to as a “team” of NICs. Typically, the team of NICs shares a common Internet Protocol (IP) address while they may or may not retain a unique Media Access Control (MAC) addresses for each NIC. One aspect of using this team configuration is that network traffic between the computing device and other computing devices in the network may be distributed among the NICs in the team such that the overall throughput of the team may be maximized. This type of operation is referred to as “load balancing.” Another aspect of using a team configuration is that traffic may be migrated from a nonfunctional or unreliable NIC within the team to a functional or more reliable NIC within the team. This type of operation is referred to as “failover.” Both operations require the capability to identify communication faults in the computer network on an ongoing basis.
In a networked computing environment, any component (the switch, the networking cables, or the NICs) in the network may become faulty, leading to poor network reliability. The difficulty of diagnosing network faults is exacerbated by the possibility that a NIC may experience a partial failure, in that it may not be able to receive data without having the capability to transmit data (or the reverse). Finally, it is possible for a NIC to transmit and receive data, but for that data to be exchanged with a substantially higher error rate than is desired. The higher error rate may lead to substantial retransmissions of data and an unacceptable increase in overall network traffic.
One method of identifying a faulty NIC within a team of NICs is to transmit “keep-alive” packets between the NICs to verify that data is being received and transmitted properly between the various NICs. These keep-alive packets are additional packets generated exclusively for the purpose of verifying network connectivity (the simultaneous capability to receive and transmit data from each NIC). Typically, the NIC device driver in the operating system generates and manages the keep-alive packets.
In a computing device containing a team of two NICs, a common method for monitoring the reliability of the two NICs is to transmit a first keep-alive packet from the first NIC to the second NIC and then to transmit a second keep-alive packet from the second NIC to the first NIC. If both keep-alive packets are successfully received, the transmission and reception capabilities of both NICs are confirmed for the current round of testing, called a “keep-alive cycle.” On the other hand, if one or both packets are not received, then a problem clearly exists with at least one of the NICs or with their interconnection network (the cable(s) and/or the switch(es)).
Although this approach may be used to identify situations where a NIC within the team has failed (for transmitting or receiving or both), one disadvantage of the approach is that when there are only two NICs in a team, this technique does not identify which specific NIC within the team that has failed. Without knowing the location of the faulty NIC, the computing device cannot failover the existing communications to the fully functional NIC.
In a computing device containing a team of three or more NICs, a common method for monitoring the reliability of the three NICs is for each NIC to transmit a keep-alive packet to every other NIC in the team. For example, in a three NIC team, the first NIC would first transmit a keep-alive packet to both the second NIC and the third NIC. Then, the second NIC would transmit a keep-alive packet to the third NIC and the first NIC. Finally, the third NIC would transmit a keep-alive packet to the first NIC and the second NIC. In the event one NIC of the three NICs in the team fails (i.e. has a transmission or reception problem), the failed NIC is easily identifiable since the remaining NICs are able to transmit and receive the keep-alive packets.
One disadvantage of this approach is that each NIC in the team transmits keep-alive packets to every other NIC in the team, generating substantial network traffic, a problem which becomes especially when a team has a large number of members. This problem is exacerbated when keep-alive packets are sent frequently.
As the foregoing illustrates, what is needed in the art is a more efficient technique for identifying a failed NIC within a team of NICs.