1. Technical Field
The present invention relates to an improved data processing system. More specifically, the present invention is directed to a method, apparatus, and computer program product for recovering from transient errors in arrays and latches in and supporting a micro-processor by restoring registers to a known correct state earlier checkpointed for the processor and providing for directing processing to a service processor for certain errors.
2. Description of Related Art
A symmetric multiprocessing (SMP) data processing system has multiple processors that are symmetric such that each processor has the same processing speed and latency. An SMP system may be logically partitioned to have one or more operating systems that divide the work into tasks that are distributed evenly among the various processors by dispatching programs to each processor.
Modern micro-processors are usually superscalar, which means a single processor can decode, dispatch, and execute multiple instructions each processor cycle. These modern processors may also support simultaneous multi-threading (SMT), which means each processor can concurrently execute more than one software program (thread) at a time. An SMT processor typically has the ability to favor one thread over another when both threads are running on the same processor. Each thread is assigned a hardware-level priority by the operating system, or by the hypervisor in a logically partitioned environment. The Hypervisor may assist error correction by providing special handling to a microprocessor that has issued a machine check signal or a Hypervisor interrupt.
Static Random Access Memories (SRAM) have been susceptible to transient errors due to naturally occurring radiation for several generations of integrated circuits. As the scale of gates of various kinds has been reduced, even non-SRAMs, e.g. latches, have become susceptible to this problem. This phenomenon must be handled in order for further reduced size architectures to be useful and always correct when delivered to a customer in a processing device.
Also potentially problematic is contending with extremely rare sequences and combinations of instructions and states that invariably result in incorrect results each time such sequences and combinations occur. Typically, such so called ‘functional errors’ or ‘bugs’ would be discovered through intensive testing of a design prior to general availability. With extremely complex, superscalar, multi-threaded processors, used in incrementally scaleable large SMPs, with large numbers of virtual partitions, the verification state space approaches infinite. Validation of such a large state space often exceeds the capacity of formal verification tools and simulation test cases. Prototype hardware is typically manufactured for intensive testing at machine speeds, but unfortunately some mis-handled combinations of rare events may occur so infrequently that they are encountered very late or not at all during the prototype testing. Modifying and manufacturing additional prototypes to fix late found design bugs is expensive and time consuming, which may delay a product from reaching the market.
Often such design errors could be avoided by reducing the number and complexity of operations going on in the processor, thereby dramatically reducing the total state space, making the mis-handled combination of events more rare, or even impossible. Avoiding the use of complex superscalar pipelining techniques such as multiple instruction decode, dispatch, and execution; load and branch look-aheads; imprecise exception mode; pre-fetching; out-of-order processing, and simultaneous multi-threading (SMT) would reduce the total possible state space of a processor to a level where simulation tools would be adequate to ensure correct operation. However, modern processor throughput demands are such that dropping such techniques entirely would result in a commercially unviable processor. But it would be advantageous to temporarily suspend or disable such complex controls only when required to avoid a mis-handled combination of rare events. It is unknown in the prior art to forbear from using superscalar pipelining techniques and other modes (now considered normal) just so that a sequence of instructions which encounters erroneous operation can be retried successfully by avoiding combinations of rare events which resulted in the erroneous operation.
Increasing circuit density with new technologies is causing power consumption to become a limiting factor in microprocessor designs. In order to minimize power consumption, portions of the circuitry which are not required for a particular active operation are “turned off” by suppressing the clocks to them. Suppressing the clocks results in less circuit switching, and hence less power consumption. During periods of very low workload, large portions of the processor may be put into a low-power state, sometimes referred to as “nap” or “doze” modes. In the event of an error, where a prior checkpoint state is refreshed to the processor, the logic which is in the low-power state must be woken to allow it to also be reset and refreshed to the prior checkpoint state. This management of low-power states during processor recovery is not included in the prior art.
Virtualization of processors in large SMP systems requires efficient (fast) address translation to maintain throughput. A common technique to improve address translation performance is through the use of “look-aside” buffers which remember results from prior translations so they can be simply reused instead of recalculated. A look-aside buffer contains a relatively small number of entries, so after some time entries need to be discarded to make room for newer entries. If the result for a translation is not available in a look-aside buffer, it must be re-calculated through a series of memory accesses and additions. Once the first pointer to memory is known, hardware state machines can traverse a linked-list of address pointers to perform the translation. However, the first address pointer, which points to a storage “segment”, cannot be determined by the hardware state machines. Segment pointers are managed by the operating system and hypervisor, and are stored in a Segment Lookaside Buffer (SLB) in the processor.
Unfortunately, the size of the SLB is such that it is prohibitively expensive to provide a backed-up copy of it within the processor chip die. Thus, in the event of any failure, a means to determine if the SLB contents were potentially corrupted and obtain and synchronize backed-up data is necessary, but not yet encountered in the prior art.