High performance computing (HPC) systems include large, distributed systems having many computing nodes that communicate with each other to solve a shared computation. The connections between nodes are often formed from high speed serial interconnects that transmit bits of data (i.e., ones and zeroes) in parallel data lanes at a maximum speed, or bit rate. The long term reliability of high speed serial interconnects is being challenged as transmission rates increase. In particular, as bit rates increase, there is a corresponding increase in signal loss caused by the underlying physical media. This signal loss is managed by increasing circuit complexity, using higher cost materials, and actively repeating the signal (or reducing the physical distance between nodes). All of these mitigation tools attempt to achieve high Mean Time To False Packet Acceptance (MTTFPA), with maximum service time or availability.
One cause of signal loss is due to the physical nature of digital transmission media. Bits of data are transmitted on a physical medium by varying a voltage. For example, +1V may represent a value of 1, while −1V may represent a value of 0; different systems will use different voltages, but the concept is the same. The voltages are sampled by the receiving circuitry to determine the bit values at fixed intervals according to the bit rate. The physical material forming the channel that conveys these voltages from one point to another requires a small time to transition from one voltage to the other. Due to this voltage transition time, each transmitted bit “smears” its voltage value into a time slot allocated to the next bit. If the voltage transition frequency (that is, the reciprocal of the transition time) represents even a modest fraction of the bit rate, this smearing may cause the sampled voltage to reach less than its optimum value at the sampling time, and in fact may cause the sampled voltage to indicate an incorrect bit value. This is common when, for example, a 1-bit value (e.g. +1V) follows a long sequence of 0-bits that have given the physical media a sufficient time to strongly adjust to the 0-bit voltage (−1V).
This situation is improved by the use of pseudo-random bit patterns for transmitting bits. Data for transmission often are run through various encoding and decoding algorithms to provide a roughly even mix of transmitted 0-bits and 1-bits, which provides various electrical and data protocol advantages by reducing the probability of long 0-bit or 1-bit sequences. However, it is still possible to generate long runs of zeros or ones, occasionally causing strong voltage “smear” and incorrect bit decisions at the optimum sampling time as described above.
For these reasons, various algorithms are employed to increase the gap between the minimum sampled 1-bit voltage and the maximum sampled 0-bit voltage. These gaps are often visualized using plots of the voltage waveforms overlapping over time. These plots are called “eye diagrams” because the shape of the overlapping waveforms resembles the appearance of a human eye. Algorithms for increasing the distance between sampled voltages “open the eyes” in the diagram, and improve bit recognition rates by receivers.
One of the common algorithms for increasing the voltage gap is decision feedback equalization (DFE). This algorithm works by sampling the “smear” caused by receipt of each bit (or a number of previous bits) and adding a multiple of that smear to the voltage of the next received signal, before the signal is sampled to determine the bit. By adding this correction signal, the channel dynamically increases the 1-bit voltage and decreases the 0-bit voltage for each sampled bit by using feedback from the previous bit(s), thereby “equalizing” the channel voltages. The smeared bits that are sampled for this purpose are called “tapped”, and thus there are single-tap and multi-tap DFEs.
The DFE algorithm can cause burst errors, as now described. A “burst error” occurs when an algorithm for which the determination of a given bit value as a 0 or a 1 influences the determination of the next bit value, so that errors have a chance to propagate to one or more subsequent bits, thereby producing a localized “burst” of errors. In the case of DFE, the addition of smear from an incorrectly detected bit value will move the channel voltage in the wrong direction when the next bit is sampled, increasing the probability that the next bit will also be incorrectly detected. The DFE smear multiplier is typically small, e.g. one tenth of the transmission voltage in an exemplary system, so the system will eventually correct itself after detecting a correct bit (or, in a multi-tap system, a series of correct bits). However, correction will occur only after the receiver has been impacted by the burst error.
Various means are known to reduce the effect of burst errors on receivers. Channel architecture and data coding can improve burst error minimization. Specifically, data striping and bit transmission order impact error detection and correction capability in the presence of burst errors. Recently, some standards organizations have developed standards (IEEE 802.3bj and IFBTA) that incorporate Forward Error Correction (FEC) into their specifications to address reliability issues. FEC involves the addition of check bits into the data stream, and thereby permits correction of faulty bits, but at the cost of significant added latency. Moreover, this latency cumulates over each hop between nodes in an HPC system. At 25 gigabits per second (Gb/s) bit rate, the latency increase for an exemplary standards-based Reed-Solomon FEC code is greater than 50 nanoseconds (ns) per hop. This increase is unacceptable for a latency sensitive protocol in a HPC system, and in particular for one where the average uncorrected per-hop latency is well under 100 ns, because it dramatically reduces system performance. The status quo data transmission that lacks error correction and processes the resultant uncorrected burst error bits, also is unacceptable because it can lead to system crashes.