1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to an improved method of dispatching program instructions in a processor and to an improved processor design.
2. Description of the Related Art
High-performance computer systems use multiple processors to carry out the various program instructions embodied in computer programs such as software applications and operating systems. A typical multi-processor system 10 is shown in FIG. 1. Computer system 10 has one or more processing units arranged in one or more processor groups; in the depicted system, there are four processing units 12a, 12b, 12c and 12d in processor group 14. The processing units communicate with other components of system 10 via a system or fabric bus 16. Fabric bus 16 is connected to a system memory 20, and various peripheral devices 22. Service processors 18a, 18b are connected to processing units 12 via a JTAG interface or other external service port. A processor bridge 24 can optionally be used to interconnect additional processor groups. System 10 may also include firmware (not shown) which stores the system's basic input/output logic, and seeks out and loads an operating system from one of the peripherals whenever the computer system is first turned on (booted).
System memory 20 (random access memory or RAM) stores program instructions and operand data used by the processing units, in a volatile (temporary) state. Peripherals 22 may be connected to fabric bus 16 via, e.g., a peripheral component interconnect (PCI) local bus using a PCI host bridge. A PCI bridge provides a low latency path through which processing units 12a, 12b, 12c and 12d may access PCI devices mapped anywhere within bus memory or I/O address spaces. PCI host bridge 22 also provides a high bandwidth path to allow the PCI devices to access RAM 20. Such PCI devices may include a network adapter, a small computer system interface (SCSI) adapter providing interconnection to a permanent storage device (i.e., a hard disk), and an expansion bus bridge such as an industry standard architecture (ISA) expansion bus for connection to input/output (I/O) devices including a keyboard, a graphics adapter connected to a display device, and a graphical pointing device (mouse) for use with the display device.
In a symmetric multi-processor (SMP) computer, all of the processing units 12a, 12b, 12c and 12d are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. As shown with processing unit 12a, each processing unit may include one or more processor cores 26a, 26b which carry out program instructions in order to operate the computer. An exemplary processor core includes the Power5™ processor marketed by International Business Machines Corp., which comprises a single integrated circuit superscalar microprocessor having various execution units (fixed-point units, floating-point units, and load/store units), registers, buffers, memories, and other functional units, which are all formed by integrated circuitry. The processor cores may operate according to reduced instruction set computing (RISC) techniques, and may employ both pipelining and out-of-order execution of instructions to further improve the performance of the superscalar architecture.
Each processor core 26a, 26b may include an on-board (L1) cache (typically separate instruction cache and data caches) implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory 20. A processing unit can include another cache, i.e., a second level (L2) cache 28 which, along with a memory controller 30, supports both of the L1 caches that are respectively part of cores 26a and 26b. Additional cache levels may be provided, such as an L3 cache 32 which is accessible via fabric bus 16. Each cache level, from highest (L1) to lowest (L3) can successively store more information, but at a longer access penalty. For example, the on-board L1 caches in the processor cores might have a storage capacity of 128 kilobytes of memory, L2 cache 28 might have a storage capacity of 512 kilobytes, and L3 cache 32 might have a storage capacity of 2 megabytes. To facilitate repair/replacement of defective processing unit components, each processing unit 12a, 12b, 12c, 12d may be constructed in the form of a replaceable circuit board or similar field replaceable unit (FRU), which can be easily swapped installed in or swapped out of system 10 in a modular fashion.
In a superscalar architecture, instructions may be completed in-order and out-of-order. In-order completion means no instruction can complete before all instructions dispatched ahead of it have been completed. Out-of-order completion means that an instruction is allowed to complete before all instructions ahead of it have been completed, as long as a predefined rules are satisfied. Within a pipeline superscalar processor, instructions are first fetched, decoded and then buffered. Instructions can be dispatched to execution units as resources and operands become available. Additionally, instructions can be fetched and dispatched speculatively based on predictions about branches taken. The result is a pool of instructions in varying stages of execution, none of which have completed by writing final results to the system memory hierarchy. As resources become available and branches are resolved, the instructions are retired in program order, thus preserving the appearance of a machine that executes the instructions in program order. Overall instruction throughput can be further improved by modifying the hardware within the processor, for example, by having multiple execution units in a single processor core.
Another technique known as hardware multithreading can be used to independently execute smaller sequences of instructions called threads or contexts. When a processor, for any of a number of reasons, stalls and cannot continue processing or executing one of these threads, the processor can switch to another thread. The term “multithreading” as used by those skilled in the art of computer processor architecture is not the same as the software use of the term in which one task is subdivided into multiple related threads. Software multithreading substantially involves the operating system which manipulates and saves data from registers to main memory and maintains the program order of related and dependent instructions before a thread switch can occur. Software multithreading does not require nor is it concerned with hardware multithreading and vice versa. Hardware multithreading manipulates hardware-architected registers, execution units and pipelined processors to maintain the state of one or more independently executing sets of instructions (threads) in the processor hardware. Hardware threads could be derived from, for example, different tasks in a multitasking system, different threads compiled from a software multithreading system, or from different I/O processors. In each example of hardware multithreading, more than one thread can be independently maintained in a processor's registers.
Simultaneous multithreading (SMT) is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle. Unlike other hardware multithreaded architectures in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources. Also, unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism (ILP), simultaneous multithreading uses multiple threads to compensate for low single-thread ILP. The performance consequence is significantly higher instruction throughput and program speedups on a variety of workloads that include commercial databases, web servers and scientific applications in both multi-programmed and parallel environments.
There are still some performance disadvantages with SMT processing. In a typical SMT processor, two threads cannot be dispatched in the same cycle due to timing and complexity. In other words, one thread, and only that thread, can be dispatched in a given cycle, so another thread vying for resources must wait for its turn to be dispatched. If the dispatching thread cannot use up all resources (e.g., execution units), then one or more execution units may sit idle because the dispatching thread does not have enough instructions to feed all execution units. For example, if there were two fixed-point units (FXUs) and two load/store units (LSUs) in the processor, and if the dispatching thread only had two fixed-point instructions to be dispatched, then the two LSUs would sit idle for one cycle while the two FXUs are executing the instructions. This inefficiency can create bottlenecks in the processor, and lower overall processing throughput of the system.
It would, therefore, be desirable to devise an improved method of handling instructions in an SMT processor so as to increase the effective dispatching bandwidth. It would be further advantageous if the method could more efficiently utilize existing processor hardware without adding undue overhead.