1. Field of the Invention
This invention relates to a new method to address soft error rates without degrading cycle time and without adding significant design complexity, power consumption or area (thus controlling design, debug and manufacturing costs).
2. Description of Background
The semiconductor industry relies on aggressive scaling of device sizes to deliver continuing cost reductions of semiconductor products such as microprocessors. CMOS scaling is based upon a technique first described by Dennard et al. [JSSC 1974]. A component of CMOS scaling is the lowering of supply and threshold voltage, making circuits more susceptible to soft errors.
While in the past, a wide range of products have been able to ignore the impact of soft errors because of their low occurrence rate, increasingly they must address these issues to provide acceptable failure rates as expressed by MTBF (“mean time between failures”) as supply and threshold voltages continue to be scaled down. Thus, while in the past, only high-end reliable servers such as IBM Mainframes in the System Z family have provided robust soft error protection, lower end high-volume products must start to address such issues.
Alas, while high-end servers can provide robust soft error resilience by adding additional features, such as recovery units, high-volume parts must achieve soft error resilience using lower cost options. An exemplary description of such of robust soft error resilience in high-end servers follows.
Referring now to Prior Art FIG. 1, a state of the art microprocessor is depicted. A state of the art microprocessor typically includes a high-bandwidth instruction fetch front end; a highly accurate dynamic branch predictor; instruction decode and dispatch logic operating on a plurality of instructions simultaneously; several issues queues corresponding to several execution pipelines; several register files providing operands for the several execution pipelines; and in-order completion logic.
Referring now to Prior Art FIG. 2, there a common technique of duplicating a register file to increase the number of read ports is illustrated. In accordance with this implementation, a single architectural register file (containing renames or not) is implemented using multiple copies. Each copy receives the results from all execution pipelines and writes them to the corresponding target registers, and provides operand read ports for a subset of the execution pipelines, thereby providing a larger aggregate number of read ports than could otherwise be provided.
In addition to providing additional read ports, register file duplication also alleviates congestion and wire delay, by providing multiple physical locations for reading data values. In one use, duplicated register files also bridge a latency gap between execution pipelines, by allowing for extra delay to write back results in remote register files, thereby allowing an implementation to cope with wire delays common in today's complex high-frequency designs.
In accordance with one potential mode, instruction decode logic can indicate that an instruction should be dispatched to a specific cluster (cluster 1 and cluster 2 corresponding to the execution pipelines associated with the first and second register file copy, respectively). This decode-based steering is advantageous in a clustered microarchitecture with variable update delays taking a longer latency to write computation results to another register file copy. Thus, decode can steer dependent operations to the same cluster and reduce the impact of wire delay on execution schedules. In another mode dealing with clustered microarchitectures, some operations (e.g., a divide, or some control registers) may only be provided in one cluster but not another cluster. Decode can steer these operations using said steering indication. In one implementation we refer to this as the cluster steering indicator.
Referring now to Prior Art FIG. 3, there is shown an exemplary state of the art recovery mechanism as used in highly reliably computer systems. A Buffer Control Element 310 provides a common interface to the cache hierarchy (indicated as L1 cache 315). Two copies of a computational core 320 and 330 (indicated as I-Unit for instruction decode and dispatch units and E-unit for instruction execution unit) independently process the same instruction stream provided by the BCE to both copies of the computational core. Outputs of the duplicated computational core units are compared (indicated by box labeled “=” 340), and retired in the R-unit 350, and/or used by the Buffer Control Element to initiate memory subsystem requests.
According to this architecture, the R-unit provides a highly protected reference copy of the entire microprocessor state, and can be used to re-initiate execution, when a fault has been discovered, by loading the state into the register files of both cores.
According to other implementations, alternate designs are provided, such as using multiple executions in a shared data path to provide correctness determination, or by protecting computation results with parity or ECC protection. Depending on implementation details, arithmetic and logic computation elements can generate results including parity or ECC indication to further protect the computed data.
According to the described prior-art embodiments, a full copy of the entire state is to be maintained in the R-unit to provide a sound restart point when errors are detected using the described or any other known or unknown error detection mechanisms.
In accordance with these mechanisms, when an error is detected, recovery is performed in accordance with Prior Art FIG. 4. The method 400 commences when an error condition is detected in step 410. Recovery logic of R-unit 350 inhibits further execution in step 420. R-unit flushes all pipelines and other associated state in step 430. Modified memory data corresponding to committed known good state is retired to the memory subsystem.
After prior state has been purged from the microprocessor, R-unit initiates a recovery sequence and control is passed to special purpose recovery logic in step 440. In step 450, in accordance with embodiments of R-unit based recovery methods, dedicated data paths, either integrated in preexisting scan test logic, or otherwise integrated in the design), allows R-unit recovery logic to write and update each and every architected state bit in the microprocessor. In step 460, the state update has completed, and the microprocessor restarts execution from the recovered state.
Thus, as is evident from the description contained herein, while the R-unit provides superior fault tolerance by providing means for checking correctness and recovering when incorrect execution is determined, the costs are significant, due to the increased area for duplicating the computational core, storing a copy of the architected state distinct and separate from the execution paths, providing special purpose control and recovery paths based on providing a mode of operation to allow R-unit control, and providing special data paths to transfer data to the R-unit under normal execution and to write and update every architected state bit during the recovery sequence.
To continue delivering cost reductions by continuing to shrink device sizes in new technologies, what is needed in the art is a new method to address soft error rates. What is further needed in the art are methods and apparatus to provide such resilience without adding significant design complexity, power consumption or area (thus controlling design, debug and manufacturing costs).