1. Field of the Invention
The present invention relates to microprocessors in a data processing system. More specifically, the present invention is directed to a method and apparatus for reducing the number of cycles required to checkpoint instructions in a multi-threaded microprocessor.
2. Description of the Related Art
Most data processing systems use mechanisms for detecting (and perhaps diagnosing) errors, as well as provide for the possibility of recovering from an error. These two functions are usually distinct, requiring different hardware (and software) mechanisms.
In modern computer systems, an error can occur anytime from a variety of sources, e.g., alpha particles striking a memory cell. Usually, the data of the architected registers are saved in checkpoint registers after the instructions have passed the completion point. “Checkpointing” means the saving of the architected register data in checkpoint registers. In order to be able to checkpoint, an instruction must have finished being processed. This means that all data that is needed to execute that instruction has arrived at the appropriate execution units and has been processed. Status bits and data complete signals are used to indicate whether all necessary data has arrived and whether an instruction has completed being executed.
Checkpointing is performed by the recovery unit (RUnit). In the event that an error is detected, the recovery unit restores the architected register values from the checkpoint registers.
The RAS (Reliability, Availability, and Serviceability) concept, as implemented in hardware and software, is directed to preventing or recognizing system failures (Reliability), the ability to keep the system functioning in the event of failure (Availability), and the ability to fix the failure in a non-disruptive way (Serviceability). RAS may be addressed at various levels of system development: to diagnose design bugs during new product development, in manufacturing to identify bad parts during the system build, and to catch errors while the system is operating. RAS may also be directed to various types of failures, including system-level design oversights, logical errors, hard failures (hardware faults), or soft errors (data errors in memory or after data transfer due to external noise or circuit failure).
In some cases it is only necessary to recognize that an error has occurred. In others it is necessary to diagnose the error, that is, to specifically identify its source. Finally, in some cases it is desirable to remove or correct the error.
A commonly used method of error detection utilizes parity codes. In a one-bit parity code the number of binary “ones” (1s) in a data word are counted. The word has odd parity if the number of 1s is odd and even parity otherwise. When a word is written into memory, the parity bit is also written. The parity bit is read and checked when the word is read out. If the stored and checked parity bits do not match, an error is detected. A one-bit parity code can detect at most one bit of error in the data item. If there are two bits of error the one-bit parity code will not detect any error since the parity will match the data with two errors.
While a one-bit parity code can detect an error, it cannot detect which particular bit in a data item is in error. Error correcting codes (ECCs), on the other hand, not only detect, but allow correction of an error. Many large memory systems use single error correcting, double error detecting (SEC-DED) code that allows detection of up to two bits of error and correction of a single bit of error. An error may be detected and corrected in a number of ways, including system software or by circuitry that is integral to the particular component that generates the error.
In traditional commercial-design, EDFI (Error Detection and Fault Isolation) was achieved via special checking-logic that was integral to the data flow and control. Error detection refers to the percentage of errors that could reasonably be detected, perhaps by using the parity schemes as discussed above. The error detection capability of such systems was high (e.g., 75%-90% of hardware faults detected). Fault isolation refers to percentage of errors that could be traced to a particular component. The fault isolation of these conventional systems was also good (e.g., about 90% of the errors could be traced to a single component).
While valuable for preserving data integrity, EDFI schemes are not without their disadvantages. In conventional mainframe designs, all logical circuits were checked as the system was operating. In many cases, however, the checking-logic (e.g. parity code predictors) were as complex as the circuit components (e.g., adders or shifters) themselves. This results in larger processors with circuits driving longer wires with more fan out, and generally longer cycle times.
The above described strategy uses the same fundamental concepts as existing zSeries processors, where a Recovery Unit (RUnit) maintains an error correction code (ECC) hardened checkpointed copy of all architected registers which can be restored in case of an error, with processing resumed from the restored checkpoint. The basic RUnit concept is that working copies of registers are integrated with the execution pipelines, and are updated and available for use as soon as an instruction finishes executing, while a separate copy of all registers are held in an ECC hardened checkpoint array. Updates to the checkpoint array are delayed from the working registers to allow for error detection and reporting, where a reported error blocks all further checkpoint updates so that the checkpoint remains intact at a coherent instruction boundary. Register updates are accumulated in a write queue until ready for checkpointing and then drained from the write queue into the checkpoint array.
Instructions are dispatched in groups. In order for the entire group to be ready to checkpoint, all instructions in the group must have completed execution. Some instructions have fixed execution times, while others have variable execution times. Additionally, floating point instructions may complete out of order. To determine if instructions are ready to checkpoint, a compare is performed on the dispatch group's tag (Gtag) versus the next instruction to complete (NTC) instruction tag (Itag). In addition, data arrival at the recovery unit must be considered since the group cannot be checkpointed until all of its data has arrived.
The checkpointing of groups must keep pace with the dispataching of groups, otherwise the performance of the processor will be degraded. In order to minimize any backward pressure on the dispatch and execution pipelines and store queue release to the L2 cache, the checkpointing of groups needs to occur as quickly as possible once all necessary conditions are satisfied and reported to the recovery unit.
Thus, it would be advantageous to provide a method and apparatus for reducing the number of cycles required to checkpoint instructions in a multi-threaded microprocessor.