High performance computing (HPC) systems include large, distributed systems having many computing nodes that communicate with each other to solve a shared computation. The connections between nodes are often formed from high speed serial interconnects that transmit bits of data (i.e., ones and zeros) in parallel data lanes at a maximum speed, or bit rate. The long term reliability of high speed serial interconnects is being challenged as transmission rates increase. In particular, as bit rates increase, there is a corresponding increase in signal loss caused by the underlying physical media. This signal loss is managed by increasing circuit complexity, using higher cost materials, and actively repeating the signal (or reducing the physical distance between nodes). All of these mitigation tools attempt to achieve high Mean Time To False Packet Acceptance (MTTFPA), with maximum service time or availability.
Lane fail over is a serial link feature that removes a failing lane(s) from service if its error rate is at or approaching a level that results in unacceptable performance or MTTFPA. During many prior art fail over procedures, all lanes are removed from service, while the communications link re-initializes to a reduced width avoiding failing lane(s). During this interval, all network traffic directed towards the fail over communications link is re-routed (if alternate paths exist) or buffered. Both re-routing and buffering contribute to network congestion, reduced performance, and possibly even system failure.