1. Field of the Invention
The present invention relates generally to a processor system methodology. It particularly relates to a method and apparatus for providing an early indication of a processor soft error being propagated through a computing system.
2. Background
Modern semiconductor process technology is creating processors with smaller sizes to reduce hardware space and increase processor efficiency. However, the smaller sizes make the modern processor more susceptible to single event upsets that are transient errors (temporary or soft errors) caused by exposure to cosmic rays and/or alpha particles. Alpha particles, via atmospheric radiation or exposure to trace levels of radioactive materials in packaging, may permeate the computing processor and cause state devices (e.g., flip-flops) to make unplanned transitions from one state to another (e.g., bit value changes from 1 to 0). Also, for computing processors designed with domino logic (a type of circuit design of cascaded logic that are pre-biased), these transient errors may propagate throughout the entire system logic causing further instability and ultimately a hard failure (e.g., device taken out of service).
Additionally, “silent data corruption” may develop in processor computing systems where errors occur but are not detected by error checking logic. A hypothetical example may be a misplacement of the decimal point when performing accounting operations. Although a definite error has occurred (e.g., $10,000.00 instead of $100.00 payment), the accounting operations continue to completion and the system believes all operations were completed successfully. This type of “silent error” encourages the design of parallel processing to ensure that all computing elements calculate the same result (answer).
Several methods may be used for error detection/correction where one common method is the use of error detecting bits (e.g., parity bits) to help detect errors when they occur. Using this technique, a bit error may be detected when a parity bit is commonly applied to an 8-bit data field (one of the nine bits is in error). For this simple use of parity bits, the error is ambiguous as all that is known is that there is an error, and there is no information about what kind of error or what recovery mechanism can be implemented. Another technique uses error correcting code (ECC) memory to actually correct errors. This technique uses multiple parity bits, each having a different definition, to help uniquely specify and correct the error. Each parity bit used indicates an error in a subset of the data field which helps narrow down the possibilities of exactly which bit is in error. An additional technique uses parity syndrome bits where the unambiguous errors occurring may be detected and also corrected since this method identifies the bits in error.
Modern processor systems commonly employ a multiple processor structure where parallel processing is performed using a plurality of processors (usually linked in lockstep) to execute instructions and compute answers simultaneously. These processing systems typically use ECC logic and parity syndrome logic to detect and correct constant errors occurring along critical data paths (paths tied to memory arrays). However, soft (transient) errors may occur along the non-critical data paths (paths along which the instruction steam is processed and executed) within the processor that use random logic.
For these parallel processing systems that are commonly connected in a functional redundancy check, both processors execute the instruction stream, along these non-critical data paths, on a clock by clock basis and compare the resulting architectural state updates. If the architectural states (computed answers) differ, an ambiguous error has occurred (similar to the simple use of parity bits). There is enough information to determine that there is a problem, but unless there is sufficiently redundant information, logic or software cannot determine which information is the correct one. The appearance of soft errors where only the architectural state is being compared will corrupt the program flow being currently executed. If this is a restartable transaction in a database system, the operating system software may simply restart the program flow. Alternatively, however, if the operating system (OS) is performing critical system table updates, the error may cause an OS panic and system crash. Somewhere between these two extreme responses would be a system application that just suddenly terminates, leaving the system application user in an unknown state and clearly without his work finished. To prevent these undesirable responses from occurring, there is a need to protect the non-critical data paths of the processor system with a mechanism that provides early detection of soft errors within stages of a multiple stage, pipelined processor system before they propagate to ambiguous error detection.