In the past, it was very common for computer systems to use wide parallel buses with many bits or bitlanes in a parallel configuration. These buses would deliver a dataword from a source to a receiver in one transfer. Thus, for example, a commonly used bus would deliver 64 databits to its destination every transfer cycle. Such a bus could be found both on-chip, on-module, and on-board. Also in the past, it was very common for communications systems to use a narrow, single wire bus with only one bitlane used per bus. These buses would deliver their dataword from a single source to a single (or multiple) receivers over many transfer cycles, i.e., one bit after another would be sent down the bitlane until the entire payload or dataword was delivered.
In order to insure that the data arrives safely at the receiver, error checking and/or error correcting on the bus is employed. In high-reliability computers, the parallel buses are typically protected with an ECC scheme. In high-reliability communications links, cyclical redundancy checking (CRC) is often employed. Generally speaking, ECC is usually used to provide “real-time” correction of a bad databit(s), and CRC is usually used to provide “real-time” detection of a bad databit(s). In an ECC scheme, the data is manipulated by the logic of the ECC to adjust the data received by the receiver such that “good” data is passed along downstream. In the CRC scheme, the data source is required to resend the bad dataword when signaled by the CRC that bad data was received. In such systems, ECC tends to be more effective when the nature of the errors is permanent (e.g., hard errors), and CRC tends to be more effective when the nature of the errors is transient (e.g., soft errors).
In future electronic systems, the traditional boundaries between computers and communication systems is blurring. Data is often transferred along a parallel, high-speed bus over several transfer cycles. This scheme provides very high bandwidth, but it also makes it necessary to deal with both hard and soft errors. Hard errors occur when the physical medium experiences a fault, such as a burned-out driver. Soft errors occur when a bit along a single bitlane is flipped due to conditions such as noise, skew and/or jitter.
The industry is moving in the direction of using CRC across the multiple bitlanes of a high-speed, parallel bus that signals for a retry whenever an error is present. These schemes have strong error detection, which is effective for soft errors, but they cannot correct an error, which makes them less useful for hard errors. In systems where hard error protection is necessary, an extension to the CRC has been proposed which includes a spare bitlane in the bus such that when a hard error is encountered, the bus will re-configure itself to replace the failing bitlane with the presumably-good spare bitlane.
Another alternative for providing protection for both hard and soft errors is a symbol-protecting bus ECC structure, where the symbols are defined along the bitlanes, rather than the traditional, across-word structure. This may be utilized when the ECC word is received in multiple packets. The ECC word includes data bits and ECC bits arranged into multi-bit ECC symbols, where each of the ECC symbols is associated with one of the bitlanes on the memory bus. The ECC symbols are then used to perform error detection and correction for the bits in the ECC word received via the bitlane and associated with the ECC symbol. This has been described in United States Patent Publication No. US20060107175A1, of common assignment herewith, filed Oct. 29, 2004, entitled: “System, Method and Storage Medium for Providing Fault Detection and Correction in a Memory Subsystem.”
In data transmissions along communication channels or buses, noise in the channel cannot practically be eliminated. As a result, soft errors occur in the data being transmitted across the buses. In addition, systematic permanent faults, or hard errors, such as a break in a wire or a malfunction of a driver or receiver, also result in errors in the data being transmitted. A limitation of the extension to the CRC approach described above is that the system completely stops functioning while the system waits for a spare wire to be deployed. A limitation of the symbol-protecting bus ECC structure described above, when implemented with a bus that includes a dedicated spare wire, is that it is vulnerable to an all-too-common soft error while the system waits for a spare wire to be deployed. It would be desirable to be able to overcome both of these limitations by providing a system that continues to function and provide error correction and detection while waiting for a spare wire to be deployed. In addition, it would be desirable to have a fault-tolerant high-speed parallel bus that is resilient to both hard and soft errors.