1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to a method of operating a recovery unit in a microprocessor that uses checkpoint registers.
2. Description of the Related Art
High-performance computer systems use multiple processors to carry out the various program instructions embodied in computer programs such as software applications and operating systems. A typical multi-processor system 10 is shown in FIG. 1. Computer system 10 has one or more processing units arranged in one or more processor groups; in the depicted system, there are four processing units 12a, 12b, 12c and 12d in processor group 14. The processing units communicate with other components of system 10 via a system or fabric bus 16. Fabric bus 16 is connected to a system memory 20, and various peripheral devices 22. Service processors 18a, 18b are connected to processing units 12 via a JTAG interface or other external service port. A processor bridge 24 can optionally be used to interconnect additional processor groups. System 10 may also include firmware (not shown) which stores the system's basic input/output logic, and seeks out and loads an operating system from one of the peripherals whenever the computer system is first turned on (booted).
System memory 20 (random access memory or RAM) stores program instructions and operand data used by the processing units, in a volatile (temporary) state. Peripherals 22 may be connected to fabric bus 16 via, e.g., a peripheral component interconnect (PCI) local bus using a PCI host bridge. A PCI bridge provides a low latency path through which processing units 12a, 12b, 12c and 12d may access PCI devices mapped anywhere within bus memory or I/O address spaces. The PCI host bridge interconnecting peripherals 22 also provides a high bandwidth path to allow the PCI devices to access RAM 20. Such PCI devices may include a network adapter, a small computer system interface (SCSI) adapter providing interconnection to a permanent storage device (i.e., a hard disk), and an expansion bus bridge such as an industry standard architecture (ISA) expansion bus for connection to input/output (I/O) devices including a keyboard, a graphics adapter connected to a display device, and a graphical pointing device (mouse) for use with the display device.
In a symmetric multi-processor (SMP) computer, all of the processing units 12a, 12b, 12c and 12d are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. As shown with processing unit 12a, each processing unit may include one or more processor cores 26a, 26b which carry out program instructions in order to operate the computer. An exemplary processor core includes the Power5™ processor marketed by International Business Machines Corp., which comprises a single integrated circuit superscalar microprocessor having various execution units (fixed-point units, floating-point units, and load/store units), registers, buffers, memories, and other functional units, which are all formed by integrated circuitry. The processor cores may operate according to reduced instruction set computing (RISC) techniques, and may employ both pipelining and out-of-order execution of instructions to further improve the performance of the superscalar architecture.
Each processor core 26a, 26b may include an on-board (L1) cache (typically separate instruction cache and data caches) implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory 20. A processing unit can include another cache, i.e., a second level (L2) cache 28 which, along with a memory controller 30, supports both of the L1 caches that are respectively part of cores 26a and 26b. Additional cache levels may be provided, such as an L3 cache 32 which is accessible via fabric bus 16.
Each cache level, from highest (L1) to lowest (L3) can successively store more information, but at a longer access penalty. For example, the on-board L1 caches in the processor cores might have a storage capacity of 128 kilobytes of memory, L2 cache 28 might have a storage capacity of 512 kilobytes, and L3 cache 32 might have a storage capacity of 2 megabytes. To facilitate repair/replacement of defective processing unit components, each processing unit 12a, 12b, 12c, 12d may be constructed in the form of a replaceable circuit board or similar field replaceable unit (FRU), which can be easily swapped installed in or swapped out of system 10 in a modular fashion.
In a superscalar architecture, instructions may be completed in-order and out-of-order. In-order completion means no instruction can complete before all instructions dispatched ahead of it have been completed. Out-of-order completion means that an instruction is allowed to complete before all instructions ahead of it have been completed, as long as a predefined rules are satisfied. Within a pipeline superscalar processor, instructions are first fetched, decoded and then buffered. Instructions can be dispatched to execution units as resources and operands become available. Additionally, instructions can be fetched and dispatched speculatively based on predictions about branches taken. The result is a pool of instructions in varying stages of execution, none of which have completed by writing final results to the system memory hierarchy. As resources become available and branches are resolved, the instructions are retired in program order, thus preserving the appearance of a machine that executes the instructions in program order. Overall instruction throughput can be further improved by modifying the hardware within the processor, for example, by having multiple execution units in a single processor core.
Computer system 10 may also use a computing technique known as hardware multithreading to independently execute smaller sequences of instructions called threads or contexts. When a processor, for any of a number of reasons, stalls and cannot continue processing or executing one of these threads, the processor can switch to another thread. The term “multithreading” as used by those skilled in the art of computer processor architecture is not the same as the software use of the term in which one task is subdivided into multiple related threads. Software multithreading substantially involves the operating system which manipulates and saves data from registers to main memory and maintains the program order of related and dependent instructions before a thread switch can occur. Software multithreading does not require nor is it concerned with hardware multithreading and vice versa. Hardware multithreading manipulates hardware-architected registers, execution units and pipelined processors to maintain the state of one or more independently executing sets of instructions (threads) in the processor hardware. Hardware threads could be derived from, for example, different tasks in a multitasking system, different threads compiled from a software multithreading system, or from different I/O processors. In each example of hardware multithreading, more than one thread can be independently maintained in a processor's registers.
Anytime during operation of computer system 10, a bit in a given memory cell or processor register may contain an incorrect value due to a soft error such as stray radiation or electrostatic discharge. Different techniques are used to detect and correct such errors. Processor cores 26 include a recovery unit that saves a copy of the data of the architected registers in checkpoint arrays once instructions have passed the completion point. In the event that an error is detected in the data output, the architected register values are restored from the checkpoint.
Processor cores 26 dispatch program instructions in groups. In order for an entire group of instructions to be ready to checkpoint, all instructions in the group must have completed execution. Some instructions have fixed execution times, while others have variable execution times. Additionally, some instructions may complete out of order. To determine if instructions are ready to checkpoint, an instruction group identifier or tag is defined as one plus the instruction tag of the youngest instruction in the group (i.e., the group tag for the current group is the same as the instruction tag of the oldest instruction in the next group). A comparison may then be performed between the group tag of the currently dispatching instruction group and the tag for the next instruction to complete. If the current group tag is the same as the tag of the next instruction to complete, then all of the instructions in the current group have safely completed and the results from the registers can be checkpointed.
Data arrival at the recovery unit can present difficulties since the instruction group cannot be checkpointed until the data from all of the instructions have arrived, but the checkpointing of groups must keep pace with the dispatching of groups or else the performance of the processor can be severely degraded. In a pipelined microprocessor with a low number of gate delays per stage, the comparison between the group tag and the next-to-complete tag requires a majority of the cycle time, with extra status checks taking much of the remainder. This comparison result must be used to manage the data queues that hold the result data, and the decision must also be fed to the circuits that control the read and write pointers as well as other control circuits that are distributed throughout the recovery unit. However, circuit latencies leave insufficient time to perform a hold or advance the controls of the read points of the recovery unit queue and can delay the deallocation/release of resources within the recovery unit. It would, therefore, be desirable to devise an improved method of checkpointing data in a recovery unit which could reduce or minimize any backward pressure on the dispatch and execution pipelines and store queue release to L2 cache by allowing checkpointing of an instruction group to quickly occur once all necessary conditions are satisfied and reported to the recovery unit.