Communication in a computer network involves the exchange of data between two or more entities interconnected by communication links. These entities are typically software programs executing on computer platforms, such as end nodes and intermediate network nodes. Examples of an intermediate network node may be a router or switch that interconnects the communication links to enable transmission of data between the end nodes, such as servers having processor, memory and input/output (I/O) storage resources.
Communication software executing on the end nodes correlates and manages data communication with other end nodes. The nodes typically communicate by exchanging discrete frames or packets of data according to predefined protocols. In this context, a protocol consists of a set of rules defining how the nodes interact with each other. In addition, network software executing on the intermediate nodes allows expansion of communication to other end nodes. Collectively, these hardware and software components comprise a communications network and their interconnections are defined by an under-lying architecture.
The InfiniBand Architecture (IBA) is an I/O specification that defines a point-to-point, “switched fabric” technology used to, among other things, increase the aggregate data rate between processor and storage resources of a server. The IBA is described in the InfiniBand™ Architecture Specification Volume 1, Release 1.0.a, by InfiniBand Trade Association, Jun. 19, 2001, which specification is hereby incorporated by reference as though fully set forth herein. Broadly stated, the switched fabric technology may be embodied in a network switch configured to receive data traffic (IBA packets) from one or more input ports and forward that traffic over one or more output ports to an IBA communications network. A switch fabric of the network switch may interconnect a plurality of modules having input (ingress) and output (egress) ports that provide, e.g., Fibre Channel or Gigabit Ethernet link connections to the network.
Some network switches include fault tolerant features that enable single error (fault) detection and correction. These switches are typically fully redundant such that there is no single point of failure. A failure is defined as an unpredictable event that arises in the switch. The architecture of the switch may account for congestion that leads to dropping of packets; this is not typically considered a failure. Higher-level protocols executing on the switch in various parts of the network may take a long time to respond to failures detected by those protocols. This latency may result in increased traffic loss and congestion, along with other problems. The present invention is directed, in part, to detecting failures or errors as soon as possible in the switch.
In a fully redundant network switch system, any single fault only disables the module on which the fault occurs. Other modules in the switch may experience performance, but not functional, loss. Although the redundant network switch is single-fault tolerant, multiple simultaneous faults can still “cripple” the switch. To maintain a fault tolerant system, any single fault must be detected and repaired as soon as possible to avoid a multiple fault situation. The present invention is further directed to providing an assist that detects when there may be an actual error (fault) in the network switch so that the fault can be corrected to thereby reduce the possibility of multiple faults occurring at substantially the same time.