1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to an improved method of handling program instructions in a processor and to an improved processor design.
2. Description of the Related Art
High-performance computer systems use multiple processors to carry out the various program instructions embodied in computer programs such as software applications and operating systems. A typical multi-processor system 10 is shown in FIG. 1. Computer system 10 has one or more processing units arranged in one or more processor groups; in the depicted system, there are four processing units 12a, 12b, 12c and 12d in processor group 14. The processing units communicate with other components of system 10 via a system or fabric bus 16. Fabric bus 16 is connected to a system memory 20, and various peripheral devices 22. Service processors 18a, 18b are connected to processing units 12 via a JTAG interface or other external service port. A processor bridge 24 can optionally be used to interconnect additional processor groups. System 10 may also include firmware (not shown) which stores the system's basic input/output logic, and seeks out and loads an operating system from one of the peripherals whenever the computer system is first turned on (booted).
System memory 20 (random access memory or RAM) stores program instructions and operand data used by the processing units, in a volatile (temporary) state. Peripherals 22 may be connected to fabric bus 16 via, e.g., a peripheral component interconnect (PCI) local bus using a PCI host bridge. A PCI bridge provides a low latency path through which processing units 12a, 12b, 12c and 12d may access PCI devices mapped anywhere within bus memory or I/O address spaces. PCI host bridge 22 also provides a high bandwidth path to allow the PCI devices to access RAM 20. Such PCI devices may include a network adapter, a small computer system interface (SCSI) adapter providing interconnection to a permanent storage device (i.e., a hard disk), and an expansion bus bridge such as an industry standard architecture (ISA) expansion bus for connection to input/output (I/O) devices including a keyboard, a graphics adapter connected to a display device, and a graphical pointing device (mouse) for use with the display device.
In a symmetric multi-processor (SMP) computer, all of the processing units 12a, 12b, 12c and 12d are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. As shown with processing unit 12a, each processing unit may include one or more processor cores 26a, 26b which carry out program instructions in order to operate the computer. An exemplary processor core includes the Power5™ processor marketed by International Business Machines Corp., which comprises a single integrated circuit superscalar microprocessor having various execution units (fixed-point units, floating-point units, and load/store units), registers, buffers, memories, and other functional units, which are all formed by integrated circuitry. The processor cores may operate according to reduced instruction set computing (RISC) techniques, and may employ both pipelining and out-of-order execution of instructions to further improve the performance of the superscalar architecture.
Each processor core 26a, 26b may include an on-board (L1) cache (typically separate instruction cache and data caches) implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory 20. A processing unit can include another cache, i.e., a second level (L2) cache 28 which, along with a memory controller 30, supports both of the L1 caches that are respectively part of cores 26a and 26b. Additional cache levels may be provided, such as an L3 cache 32 which is accessible via fabric bus 16. Each cache level, from highest (L1) to lowest (L3) can successively store more information, but at a longer access penalty. For example, the on-board L1 caches in the processor cores might have a storage capacity of 128 kilobytes of memory, L2 cache 28 might have a storage capacity of 512 kilobytes, and L3 cache 32 might have a storage capacity of 2 megabytes. To facilitate repair/replacement of defective processing unit components, each processing unit 12a, 12b, 12c, 12d may be constructed in the form of a replaceable circuit board or similar field replaceable unit (FRU), which can be easily swapped installed in or swapped out of system 10 in a modular fashion.
Within a pipeline superscalar processor, instructions are first fetched, decoded and then buffered. Instructions can be dispatched to execution units as resources and operands become available. Additionally, instructions can be fetched speculatively based on predictions about branches taken. The result is a pool of instructions in varying stages of execution, none of which have completed by writing final results to the system memory hierarchy. As resources become available and branches are resolved, the instructions are retired in program order, thus preserving the appearance of a machine that executes the instructions in program order. Overall instruction throughput can be further improved by modifying the hardware within the processor, for example, by having multiple execution units in a single processor core. In a superscalar architecture, instructions may be completed in-order and out-of-order. In-order completion means no instruction can complete before all instructions dispatched ahead of it have been completed. Out-of-order completion means that an instruction is allowed to complete before all instructions ahead of it have been completed, as long as a predefined rules are satisfied. Microprocessors may provide varying levels of out-of-order execution support, meaning that the ability to identify and execute instructions out-of-order may be limited.
One major motivation for limiting out-of-order execution support is the enormous complexity that is required to identify which instructions can execute early, and to track and store the out-of-order results. Additional complexities arise when the instructions executed out-of-order are determined to be incorrect per the in-order execution model, requiring their execution to not impact the architected state of the processor when an older instruction causes an exception. As processor speeds continue to increase, it becomes more attractive to eliminate some of the complexities associated with out-of-order execution. This change will eliminate logic (and its corresponding chip area, or “real estate”) from the chip which is normally used to track out-of-order instructions, thereby allowing additional real estate to become available for use by other processing functions.
A typical instruction stream is non-linear, since there are many branches in the code. A branch instruction selects one of two paths depending upon certain previously computed results. Since the next instruction address cannot be fully resolved until the branch is actually executed, there would usually be a long stall between the branch and the next instruction. As mentioned above, modem processors implement some form of prediction to speculatively fetch the instructions after the branch, in most cases eliminating this stall altogether. However, no such mechanism is perfect so there will inevitably be cases where the incorrect instructions were fetched into the machine and some form of time-consuming corrective action is required to ensure that only the right instructions are executed. During this time, no forward progress can be made.
This hindrance is further exacerbated by other delays that can occur during instruction dispatch. For example, the system might enter a stall condition for reasons other than branch misprediction, such as for a load cache “miss” which occurs when data required by an instruction is not available in a level one (L1) cache and the microprocessor is forced to wait until the data can be retrieved from a slower cache, or main memory. Obtaining data from main memory is a relatively slow operation, and when out-of-order execution is limited due to aforementioned complexities subsequent instructions cannot be fully executed until valid data is received from memory.
More particularly, an older instruction that takes a long time to execute can create a stall that may prevent any younger or subsequent instructions, including branch instructions, from executing until the time-consuming instruction completes. Without facilities to support all out-of-order execution scenarios, it is not normally possible to change instruction ordering such that forward progress through the instruction stream can be made while the missed data is retrieved.
In light of the foregoing, it would be desirable to devise an improved method of handling incorrect branch predictions. It would be further advantageous if the method could reduce delays associated with branch misprediction in microprocessors with reduced or limited support for out of order execution by identifying and executing branches in the instruction stream during a stall conditions without changing the architected state of the machine.