The present invention relates to a multithread data processing system and method which completes processing an instruction and associated fetch request for a first thread while executing instructions of a second thread to increase processing efficiency upon a thread-switch back to the first thread.
Today the most common architecture for high performance, single-chip microprocessors is the RISC, for reduced instruction set computer, architecture. As semiconductor technology has advanced, the goal of RISC architecture has been to develop processor designs that can come close to initiating one instruction on each clock cycle of the machine. This measure, clock cycles per instruction or CPI, is commonly used to characterize architectures for high performance processors. The architectural features of instruction pipelining and cache memories have made the CPI improvements possible. Pipeline instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished execution. Cache memories allow instruction execution to continue, in most cases, without waiting the full access time of a main memory.
Instruction pipelining involves processing an instruction in several stages. Each stage typically takes one clock cycle. While the number of stages can differ depending upon the RISC architecture, typical pipelining includes an instruction fetch stage wherein the instruction is obtained. Superscalar architectures, discussed in more detail below, also include a subsequent dispatch/general purpose register access stage wherein the instruction is decoded and dispatched to the correct pipeline. In discussing both pipeline and superscalar architectures in the present disclosure, these two preliminary stages will be ignored to simplify the description; however, it is to be understood that these two stages are performed in such architectures.
After the preliminary stages of instruction fetch and dispatch/general purpose register access, the next stage, illustrated as stage 1 in FIG. 1 is the address stage. During the address stage, the processor fetches operands from registers for register-to-register scalar operations, e.g., arithmetic and logical operations, or generates virtual addresses for performing a load/store instruction. In the data or second stage, stage 2, register-to-register scalar operations are completed, or the data cache, D-cache, is accessed for load/store instructions using the generated virtual address. The third stage, stage 3, is a commit stage wherein the result of the scalar operation or the data obtained from a load instruction are stored in the destination register.
FIG. 1 illustrates the life of an instruction "A" in a 3-stage pipeline. To gain a performance advantage from pipelining, multiple instructions are executed simultaneously in different stages of the pipeline.
FIG. 2 shows the execution of instructions A, B, and C in the pipeline of FIG. 1. If executed serially without pipelining, instructions A, B, and C would take nine cycles to execute. With pipelining, however, the instructions take only five cycles to execute.
The performance of a conventional RISC processor can be further increased by adopting a superscalar architecture. In a superscalar architecture, multiple functional or execution units are provided to run multiple pipelines in parallel.
FIG. 3 illustrates the pipelines for an exemplary pipelined, 4-way superscalar architecture. A unique pipeline is provided for load/store operations, arithmetic operations, logical operations, and branching operations. Branch execution is done somewhat independently of the other pipelines, but branches move through the branch pipeline like other instructions to maintain instruction order of a branch with respect to other instructions. Execution of a branch involves branch target address generation, condition code checking in the case of a conditional branch, fetching of instructions at the branch target address, cancelling execution/commitment of all instructions in the pipeline that come after a taken branch in program order, and changing the program counter for a taken branch. These stages are generally performed in stages 1-3 as illustrated in FIG. 3 and the two preliminary stages, instruction fetch and dispatch/general purpose register access, discussed above.
In a superscalar architecture, only one instruction can be dispatched to a pipeline at a time, and dependencies between instructions may inhibit dispatch or stall the pipeline. The example shown in FIG. 3 shows instructions A-I being executed. Execution of these instructions would take a minimum of 27 cycles in a non-pipelined, non-superscalar processor, and a minimum 11 cycles in a pipelined non-superscalar processor. In the pipelined, 4-way superscalar processor, however, instruction execution of instructions A-I only takes five cycles.
In this pipelined, superscalar architecture, instructions may be completed in-order and out-of-order. In-order completion means no instruction can complete before all instructions dispatched ahead of it have been completed. Out-of-order completion means that an instruction may be completed before all instructions ahead of it have been completed, as long as a predefined set of rules are satisfied. For both in-order and out-of-order execution in pipelined, superscalar systems, there are conditions that will cause pipelines to stall. An instruction that is dependent upon the results of a previously dispatched instruction which has not been completed can cause the pipeline to stall.
For instance, instructions dependent on a load/store instruction, e.g., a fetch request, which experiences a cache miss will stall until the miss is resolved, i.e., a cache hit. Keeping a high hit ratio in, for example, the data cache is not trivial, especially for computations involving large data structures. A cache miss can cause the pipelines to stall for several cycles, and the amount of memory latency will be severe if the miss ratio is high.
Although memory devices used for main memory are becoming faster, the speed gap between such memory chips and high-end processors is becoming increasingly larger. Accordingly, a significant amount of execution time in current high-end processor designs is spent waiting for cache misses, and it is expected that memory access delays will make up an increasing proportion of processor execution time unless memory latency tolerance techniques are implemented.
One known technique for tolerating memory latency is hardware multithreading. In general, hardware multithreading employs a processor that maintains the state of several tasks or threads on-chip. This generally involves replicating the processor registers for each thread.
For instance, for a processor implementing the RISC architecture, sold under the trade name PowerPC.TM. by the assignee of the present application, to perform multithreading, the processor must maintain N states to run N threads. Accordingly, the following are replicated N times: general purpose registers or GPRs, floating point registers or FPRs, condition register or CR, floating point status and control register or FPSCR, count register, link register or LR, exception register or XER, save restore registers 0 and 1 or SRR0 and SRR1, and some of the special purpose registers or SPRs. Additionally, the segment look aside buffer or SLB, can be replicated or, alternatively, each entry can be tagged with the thread number. If not replicated or tagged, the SLB must be flushed on every thread switch. Also, some branch prediction mechanisms should be replicated, e.g., the correlation register and the return stack. Fortunately, there is no need to replicate some of the larger functions of the processor such as: level one instruction cache or L1 I-cache, level one data cache or L1 D-cache, instruction buffer, store queue, instruction dispatcher, functional or execution units, pipelines, translation look aside buffer or TLB, and branch history table. When one thread encounters a delay, the processor rapidly switches to another thread. The execution of this thread overlaps with the memory delay on the first thread.
Two types of multithreading exist: hardware multithreading and software multithreading. There are two basic forms of hardware multithreading. A traditional form is to keep N threads, or states, in the processor and interleave the threads on a cycle-by-cycle basis. This eliminates all pipeline dependencies because instructions in a single thread are separated. The other form of hardware multithreading is to switch the threads on some long-latency event. A preferred embodiment of the present invention employs hardware multithreading and switches between threads on some long-latency event.
Multithreading permits the pipeline(s) to be used for useful work for a separate thread when a pipeline stall condition is detected for the current thread. Multithreading is described with respect to FIGS. 4 and 5, with FIG. 4 showing what happens when there is no multithreading. FIG. 4 illustrates the processing performed by a pipelined, 4-way superscalar architecture when a cache miss occurs for an instruction in the storage pipeline. Assume that instructions dispatched after instruction D0 have a data dependency upon instruction A0. Instruction A0 is a storage instruction which has a cache miss that takes five cycles to resolve. Accordingly, without multithreading, the pipelines stall until the cache miss is resolved for A0. Consequently, when the cache miss for instruction A0 occurs in cycle 3, the processor stalls for cycles 4-7 until the data for instruction A0 returns in cycle 7 and is committed in cycle 8. Processing of instructions then continues as shown in cycles 9 and 10.
By contrast, hardware multithreading permits the processor to remain active even when a cache miss is encountered in a first thread. As shown in FIG. 5, in cycle 3 instruction A0 has a cache miss the same as shown in FIG. 4. FIG. 5, however, represents multithread processing; and consequently, in cycle 4 the instructions of thread 0 are squashed from the pipelines, and the instructions for thread 1 are dispatched to the pipelines. The processor processes thread 1 during cycles 4, 5, 6 and 7. Note that in a non-multithread architecture, the processor would merely stall during these cycles. Switching threads can take a cycle or more, but for ease of illustration this switching time has not been accounted for in the figures of the present invention.
The processor will continue to process thread 1 until a thread switch back to thread 0 occurs. For purposes of discussion, a thread switch back to thread 0 is assumed to occur because instruction M1 experiences a cache miss. As one skilled in the art knows, however, several thread switching techniques exist. For example, one possible thread switching method can be reviewed in "Sparcle: An Evolutionary Design for Large-Scale Multiprocessors," by Agarwal et al., IEEE Micro Volume 13, No. 3, pps. 48-60, June 1993.
Because instruction M1 experiences a cache miss, the thread 1 instructions are squashed from the pipelines and instructions A0, B0, C0 and D0 for thread 0 are dispatched in cycle 10. As discussed above, the instructions following instruction D0 are dependent upon the completion of instruction A0. Processing of instructions for thread 0 then continues as shown in cycle 11. In subsequent cycles, instruction A0 experiences a cache hit because the data for instruction A0 was loaded into the data cache. Accordingly, the processor continues execution of the instructions in thread 0.
Unfortunately, in the multithread architecture of the conventional data processing system, an instruction of a first thread receiving a cache miss must wait for the first thread to become the active or foreground thread before being processed through the pipeline. For example, the Agarwal et al. article cited above discloses such a system. Consequently, this architecture requires a completing cycle before processing instructions dependent upon the instruction receiving a cache miss.