1. Technical Field
The present invention relates to error detection in a data processing system. More specifically, the present invention is directed to a method and apparatus for checkpointing instruction groups with out-of-order floating point instructions in a multi-threaded processor.
2. Description of Related Art
Most data processing systems use mechanisms for detecting (and perhaps diagnosing) errors, as well as provide for the possibility of recovering from an error. These two functions are usually distinct, requiring different hardware (and software) mechanisms.
The RAS (Reliability, Availability, and Serviceability) concept, as implemented in hardware and software, is directed to preventing or recognizing system failures (Reliability), the ability to keep the system functioning in the event of failure (Availability), and the ability to fix the failure in a non-disruptive way (Serviceability). RAS may be addressed at various levels of system development; to diagnose design bugs during new product development, in manufacturing to identify bad parts during the system build, and to catch errors while the system is operating. RAS may also be directed to various types of failures, including system-level design oversights, logical errors, hard failures (hardware faults), or soft errors (data errors in memory or after data transfer due to external noise or circuit failure).
In some cases it is only necessary to recognize that an error has occurred. In others it is necessary to diagnose the error, that is, to specifically identify its source. Finally, in some cases it is desirable to remove or correct the error.
A commonly used method of error detection utilizes parity codes. In a one-bit parity code the number of binary “ones” (1s) in a data word are counted. The word has odd parity if the number of 1s is odd and even parity otherwise. When a word is written into memory, the parity bit is also written. The parity bit is read and checked when the word is read out. If the stored and checked parity bits do not match, an error is detected. A one-bit parity code can detect at most one bit of error in the data item. If there are two bits of error the one-bit parity code will not detect any error since the parity will match the data with two errors.
While a one-bit parity code can detect an error, it cannot detect which particular bit in a data item is in error. Error correcting codes (ECCs), on the other hand, not only detect, but allow correction of an error. Many large memory systems use single error correcting, double error detecting (SEC-DED) code that allows detection of up to 2-bits of error and correction of a single bit of error. An error may be detected and corrected in a number of ways, including system software or by circuitry that is integral to the particular component that generates the error.
In traditional commercial-design, EDFI (Error Detection and Fault Isolation) was achieved via special checking-logic that was integral to the data flow and control. Error detection refers to the percentage of errors that could reasonably be detected, perhaps by using the parity schemes as discussed above. The error detection capability of such systems was high (e.g., 75%-90% of hardware faults detected). Fault isolation refers to percentage of errors that could be traced to a particular component. The fault isolation of these conventional systems was also good (e.g., about 90% of the errors could be traced to a single component).
While valuable for preserving data integrity, EDFI schemes are not without their disadvantages. In conventional mainframe designs, all logical circuits were checked as the system was operating. In many cases, however, the checking-logic (e.g. parity code predictors) were as complex as the circuit components (e.g, adders or shifters) themselves. This results in larger processors with circuits driving longer wires with more fan out, and generally longer cycle times.
The above described strategy uses the same fundamental concepts as existing zSeries processors, where a recovery unit (Runit) maintains an error correction code (ECC) hardened checkpointed copy of all architected registers which can be restored in case of an error, with processing resumed from the restored checkpoint. The basic Runit concept is that working copies of registers are integrated with the execution pipelines, and are updated and available for use as soon as an instruction finishes executing, while a separate copy of all registers are held in an ECC hardened checkpoint array. Updates to the checkpoint array are delayed from the working registers to allow for error detection and reporting, where a reported error blocks all further checkpoint updates so that the checkpoint remains intact at a coherent instruction boundary. Register updates are accumulated in a write queue until ready for checkpointing and then drained from the write queue into the checkpoint array.
Prior Runit implementations rely on fixed pipeline lengths to determine the checkpoint boundary, which does not support executing instructions out-of-order with different execution pipeline depths because checkpointing of instructions must always be in order.
Thus, it would be advantageous to provide a mechanism and apparatus for checkpointing instruction groups with out-of-order floating point instructions in a multi-threaded processor.