Error Correcting Codes assist in the reliable transmission and storage of data. Error Correcting Codes provide a mechanism by which data that has been distorted by noise or another disturbance can be recovered. Many hardware diagnostic tests for memory arrays rely on the hardware generated Error Correcting Codes to detect and correct single bit errors. Error Correcting Codes are often further enabled to detect, but not correct, multi-bit errors known as uncorrectable errors. The redundancy provided by Error Correcting Code practices is crucial for many applications where re-transmission of messages is impossible or costly.
The majority of the communication buses between subcomponents have Error Correcting Code protection built into the hardware in order to improve reliability. The Error Correcting Codes of the data are checked by the receiving piece of hardware on a bus, which is a wire or set of wires connecting more than two devices. If an error is detected, that error is recorded in the error register built into the hardware. Most of the time, this error information is reported to the service processor of the system using an interruption. A service processor typically comprises a peripheral card located in the server for performing various firmware functions. Software executing on the service processor may perform diagnostic and/or repair actions for an error.
As discussed herein, detection of a correctable error implies that the original data is recoverable based on the Error Correcting Code algorithm. Typically, when an error can be corrected, the Error Correcting Code algorithms allow for the detection of which part of the data was corrupted. This error location corresponds to a specific bit or group of bits, which may reflect a specific faulty wire or pin.
Many data and command buses have spare wires. When diagnostic firmware detects correctable errors frequently occurring on a specific wire, the diagnostic firmware may perform a self-heal operation to the hardware by reprogramming the hardware to use the spare wire instead of the faulty wire. Depending on the hardware support, this repair action can be done dynamically, without rebooting the server, or statically, during the machine's initial program load, or ILP.
An uncorrectable error means that the Error Correcting Code encoding has been damaged in transmission such that the original data cannot be discovered. In the event of an uncorrectable error, the Error Correcting Code algorithm does not allow for the identification of specific location failures because data has been lost. Therefore, the only action diagnostic firmware can perform is keeping the bus from being used until a service action can be completed.
There are many stages where a hardware failure can occur. From the point when the data is encoded with an Error Correcting Code, it passes through, at a minimum, transmitter logic, bus connectors or pins, bus wires and receiver logic. All of this occurs before the Error Correcting Code is checked and decoded. The failure modes of these stages can be categorized into two classes, or modes: single bit and full bus failure.
Single bit is the most common failure mode. This failure may occur when there is a loose or corroded connection, a degrading wire, or a malfunction in the transmission or receiver logic. In the event of a single bit failure, a repair action can be performed.
Full bus failures may occur if there is a high level of interference due to other nearby circuitry, or due to a weak transmitter or receiver. In the event of a full bus failure, it is unsafe to perform a repair action because there is a greatly increased likelihood of an uncorrectable error occurring, even if the bus is currently only experiencing correctable errors. In this instance, the bus must be disabled and routed around, if possible.
Current solutions do not attempt to identify between these error modes, but instead perform the single bit type repair action anytime there is a problem. After the repair action has been performed, another error must occur before completely removing the bus from operation. This leaves the machine at more risk for uncorrectable errors to occur until the bus has been removed.
Therefore, to facilitate improved system reliability, there exists a need to identify and distinguish between single bit and full bus failure modes in order to perform the different actions it takes to correct them.