The following description of the background of the invention is provided simply as an aid in understanding the invention and is not admitted to describe or constitute prior art to the invention.
High-speed serial links (“HSSL”) are an increasingly-popular method for interconnecting semiconductor components. HSSL may be used in the following technologies: fabric interconnect, memory interface (FBD), I/O interface (PCI-Express) and CPU connections (CIS). HSSL involve complex circuitry and can often run long distances within a system.
Link errors may occur on high-speed serial links which can cause numerous problems including transmission failure. There are a large number of types of link errors including, but not limited to a broken connection, an intermittent connection, a degraded connection, an incorrectly seated connector, system noise, soft errors in the physical block and hard errors in the physical block. Link errors can cause a number of problems including a continuous stream of bit errors, seemingly random errors, the occurrence of multiple intermittent bits or degraded bits, the failing of single or multiple bits, non-repeating single-channel errors and permanent or intermittent failures of a channel or link. Causes of link errors can include the gross failure of a connector, vibration, a cracked solder ball or trace, a corroded or contaminated connector, poor installation, disk or DRAM, radiation and ESD or latent defects. Accordingly, there are a large number of possible failure modes in high speed serial links, with a correspondingly large number of observable symptoms. Thus, it is important to be able to isolate and correct link errors in an efficient manner.
Repairing links in real-time while minimizing the risk of undetected transmission errors requires hardware to monitor error activity and invoke resilience mechanisms if too many errors occur. If a link is operating properly the error rate should be zero. However, it is difficult to determine how to treat intermittent failure since they only occur periodically. Intermittent failures on links are difficult to debug. Furthermore, certain types of multi-bit errors are not detected by conventional CRC (cyclical redundancy checks). Allowing too many intermittent errors to occur without corrective action to repair the link can result in undetected serious errors and silent data corruption (SDC).
In order to minimize the probability of SDC on links, link controllers must implement hardware to perform error handling. However, with intermittent errors it is difficult to identify individual lanes in a serial link that are causing the errors and need repair. Further, a link controller's capacity to handle errors is limited by the capacity of the error analysis engine and the rate at which link controller's might receive error information.
Accordingly, conventional systems address intermittent errors after their occurrence reaches a certain threshold. Unfortunately, several undetected errors can occur before repair begins and waiting to repair a link at this stage has several negative consequences. Thus, there is a need for a system and method for efficiently detecting and repairing intermittent errors in high speed serial links during operation.