1. Field of the Invention
The present invention relates to a method and apparatus for decreasing thread switch latency in a multithread processor.
2. Description of Related Art
Today the most common architecture for high performance, single-chip microprocessors is the RISC, for reduced instruction set computer, architecture. As semiconductor technology has advanced, the goal of RISC architecture has been to develop processor designs that can come close to initiating one instruction on each clock cycle of the machine. This measure, clock cycles per instruction or CPI, is commonly used to characterize architectures for high performance processors. The architectural features of instruction pipelining and cache memories have made the CPI improvements possible. Pipeline instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished execution. Cache memories allow instruction execution to continue, in most cases, without waiting the full access time of a main memory.
Although memory devices used for main memory are becoming faster, the speed gap between such memory chips and high-end processors is becoming increasingly larger. Accordingly, a significant amount of execution time in current high-end processor designs is spent waiting for cache misses, and it is expected that memory access delays will make up an increasing proportion of processor execution time unless memory latency tolerance techniques are implemented.
One known technique for tolerating memory latency is multithreading. Two types of multithreading exist: hardware multithreading and software multithreading. In general, hardware multithreading employs a processor that maintains the state of several tasks or threads on-chip. This generally involves replicating the processor registers for each thread.
There are two basic forms of hardware multithreading. A traditional form is to keep N threads, or states, in the processor and interleave the threads on a cycle-by-cycle basis. This eliminates all pipeline dependencies because instructions in a single thread are separated. The other form of hardware multithreading is to switch the threads on some long-latency event. A preferred embodiment of the present invention employs hardware multithreading and switches between threads on some long-latency event.
Thread switching itself, however, requires a number of clock cycles. A significant portion of the latency caused by a thread switch is attributable to the instruction fetch required to begin executing a new thread. FIG. 1 illustrates a prior art structure for dispatching instructions for processing by the plurality of pipelines of a multithread processor. As shown in FIG. 1, an instruction pass multiplexer 8 has a first and second input. The first input is connected to an instruction cache 4 which stores a plurality of instructions. The instructions stored by the instruction cache 4 belong to both the active thread, the thread currently being executed, and one or more dormant threads, threads not currently being executed. The second input of the instruction pass multiplexer 8 is connected to the main memory and/or memory sub-system 2. Main memory stores all the instructions for each thread, while a memory sub-system such as a level two cache may store less than all the instructions for the various threads, but more instructions than the instruction cache 4. As shown in FIG. 1, the instruction cache 4 is also connected to the main memory and/or memory sub-system 2.
In accordance with a line fill bypass signal received from the control logic 6, the instruction pass multiplexer 8 outputs instructions addressed from either the instruction cache 4 or the main memory and/or memory sub-system 2 by the control logic 6. The instruction pass multiplexer 8 outputs the instructions to a primary instruction queue 10. The primary instruction queue 10 stores a plurality of instructions which it outputs for dispatch to the plurality of processing pipelines implemented by the multithread processor.
When a thread switch occurs, the instructions in the primary instruction queue 10 are invalidated, and during a first clock cycle, addresses for instructions of the dormant thread which is becoming the active thread are generated by the control logic 6. During the second clock cycle, instructions are addressed based on the newly generated instruction addresses, and the control logic 6 outputs a line fill bypass signal such that the instruction pass multiplexer 8 selects the input connected to the instruction cache 4. If the instructions being addressed are resident in the instruction cache 4, then during the third clock cycle, the instruction pass multiplexer 8 outputs these instructions to the primary instruction queue 10, and the primary instruction queue 10 outputs the instructions for dispatch to the processing pipelines. Accordingly, when the instructions of the dormant thread which is becoming the active thread are resident in the instruction cache 4, a thread switch takes three clock cycles to complete.
Of course, if the instruction cache 4 does not contain the instructions of the dormant thread which is becoming the active thread addressed by the control logic 6, then the time to complete the thread switch will increase by the amount of time required to resolve the instruction cache miss.