As components of symmetrical computer systems (SMP) become denser, there are increasingly more ways that these computer systems can experience faults or errors such as soft errors in arrays or broken wires in data busses. Error Correction Codes (ECC) are often placed into designs to protect against these unexpected failures. ECC can also be useful in detecting errors caused by design deficiencies or process variations such as noise or weak array cells. The ECC logic can serve the dual purpose of correcting these errors as well as being used to debug these various issues by trapping information pertinent to the fail in set aside “trap registers” in the hardware. Because trapping logic can occupy space on the chip, tradeoffs need to be made between debug and mainline functionality.
Currently when an error is detected in the data and ECC, trap registers trap the failing data pattern and ECC pattern as well as the syndrome that was used to correct that data and ECC pattern. Error Correction Codes (ECC) rely on a multiplicity of parity groups over the same data. By grouping the various data bits in different parity group combinations, it is possible to isolate failures to only the bit or bits that failed. A representation of these groupings is typically called an h-matrix. Each parity group is eventually checked for errors. The vector of the error results is known as the syndrome. The syndrome can be used to indicate the conditions of no errors, unique correctable errors, or uncorrectable errors.
These registers can be set to only capture data when a correctable error (CE) is detected, only when and uncorrectable error (UE) is detected, or when any error is detected (default). Error correction is a logic design scheme which is capable of detecting and correcting a certain class of error. This type of error is referred to as a correctable error (CE). Error correction can also detect another class of error which is not correctable. This type of error is referred to as an uncorrectable error (UE).
These registers can also be set to capture the first occurrence of either a UE, CE or both types of error, or they can be set to always capture the latest error (default). Currently the hardware also traps a counter of the total number of times either a CE, a UE or both types (default) of error was detected in the ECC.
This functionality can be very helpful in debugging the problems and defects in the hardware that were causing the problems in the first place, but it has limitations. For example the trapping only traps the first or last error and it only counts the total number of errors that have occurred on the checked data bus. There are some instances where you may need more information about the failure that cannot be easily gleaned from the available data. For example, you may need to trap information on all of the correctable errors that have occurred on the protected data, trapping a data pattern associated with a specific fail, or stopping a system on a specific fail for further debug. A limitation with the counting register is that you can only count the total number of errors. There is no way to control which error is being counted or to count by excluding a specific error from the count. All of this information could be useful in debugging the mechanism that is causing the fail.