1. Field of the Invention
Embodiments of the present invention generally relate to the handling of data errors in computer systems.
2. Description of the Related Art
Data errors in computers can occur when a binary unit of information (i.e., bit) becomes unintentionally altered, causing a 1 to be read as 0, or vice-versa. The cause of the data error is typically some physical event that is not part of the intended function of the computer. Some examples of such events are: a cosmic ray striking a Random Access Memory (RAM) memory cell; a portion of a disk drive platter spontaneously flipping magnetization; or noise from background radiation degrading the signal in a network cable.
In the prior art, techniques have been devised to detect and correct data errors within specific computer components. For example, the use of error correction code (ECC) in RAM memory allows some errors to be corrected within the RAM memory itself. However, some types of errors cannot be corrected within the component in which they occur. In some cases, data is determined to be erroneous but cannot be corrected, and is then transferred to another component. It is possible that error may be not be detected by the downstream component. If the erroneous data is stored without notification to the user, it can appear to be normal data, and can cause further errors as it is later used by the system. This problem, known as silent data corruption, can lead to computer downtime and loss of critical data.
Some protocols in the art (e.g., HyperTransport, PCI Express) include data indicators to allow the erroneous data to be marked as “poisoned” in order to alert any downstream computer components that receive the data. However, even if the erroneous data is marked as poisoned, it is possible that the downstream components that receive the data are not configured to recognize the poisoned data indicator, or are not capable of correcting the error. If so, the result can be silent data corruption. Thus, there is a need in the art for a method of handling poisoned data so that data errors can be corrected in the most effective manner, and not lead to silent data corruption.