The performance of microprocessors has been significantly improved by reducing the instruction execution time and by concurrently executing more than one instruction per cycle, as many RISC superscaler microprocessors are currently designed to do. Today's commercial microprocessors have already achieved a good fraction of the performance level available through supercomputer systems. Furthermore, the significant increase in the clock rate and gate counts expected in the near future for single chip technology make microprocessor technology a very unique tool for obtaining additional supercomputation power in a cost effective manner.
However, the single-threaded pipelined instruction issuing architecture used in the current superscaler microprocessors, such as the i860 and the MC88100, will no longer significantly increase the computational power. The amount of parallelism which exists in a single instruction thread is limited by the data and control dependencies among the instructions. The dependencies slow down the instruction issuing rate and lead to a poor utilization of the function units in the processor. While one function unit is busy, others may be idle waiting for the results from the busy unit.
One way of improving the utilization of the function units is to interleave a plurality of different instruction threads. In this approach, a plurality of instruction threads are executed concurrently.
An instruction thread may be defined as a set of instructions belonging to a particular context. Illustratively, an instruction thread is independent of other instruction threads. Threads can be generated from a single program that exhibits sufficient parallelism or from different programs. Data and control dependencies between instructions in a single thread prevent the simultaneous issuing of instructions to all function units. However, instructions from different threads are independent of each other and can be issued to a plurality of function units concurrently.
In a multi-threaded architecture, multiple contexts are supported by the hardware so that multiple instruction threads can be executed simultaneously without context-switch overhead. Because no context-switch overhead exists among threads which are executed at the same time, both intra-thread and inter-thread instruction level parallelism can be exploited to improve execution rate and processor throughput.
Many multi-threaded architectures have been proposed to achieve higher performance and improved resource utilization in a single chip microprocessor. In R. G. Prasadh and C. L. Wu, "A Benchmark Evaluation of a Multi-threaded RISC Processor Architecture," Proc. of the International Conference on Parallel Processing, 1991, a superscaler architecture based on a VLIW model is proposed to explore the performance of a multi-threaded architecture. A dynamic interleaving technique is proposed to solve the resource contention problem. In G. E. Daddis, Jr. and H. C. Tong, "The Concurrent Execution of Multiple Instruction Streams on Superscaler Processors," Proc. of the International Conference on Parallel Processing, 1991, a system is disclosed wherein there is concurrent processing of two threads on a superscaler processor and wherein an instruction dispatch stack is used to schedule instructions at runtime. A dynamic register allocation technique is utilized to exploit both intra-thread and inter-thread instruction level parallelism.
In these prior art systems, dynamic interleaving and scheduling techniques are used to solve the contention of resources among threads.
The dynamic interleaving technique is now discussed in greater detail. FIG. 1 schematically illustrates a microprocessor 10 which concurrently executes multiple instruction threads using dynamic interleaving.
The processor 10 comprises a plurality of function units 14. The function units are labeled FU-1, FU-2, . . . , FU-N. Illustratively, there are eight such function units and these include a load/store unit that performs memory reads and writes, an integer unit that performs data move and integer add and subtract operations, a logic unit responsible for bit-field operations, an integer/floating point conversion unit to do data type conversions, a floating point adder unit, a floating point multiplier unit, a floating point divide unit and a branch unit. All units are pipelined and are capable of accepting a new instruction in every cycle. Illustratively, in FIG. 1, FU-1 is a load/store unit which accesses a data cache (not shown). The function units 14 are connected via an interconnection network 16 to the register file 18. Each instruction thread executed by the processor 10 has a private register bank in the register file 18. The processor 10 executes T instruction threads labeled 1,2, . . . , T. Thus, the register file 18 comprises T register banks, one for each instruction thread.
Illustratively, each register bank comprises thirty-two 32-bit integer registers and sixteen 64-bit floating point registers. The integer, logic, load/store and branch units can access only the integer registers, while the floating point units are restricted to using the floating point registers. Only the integer/floating-point conversion unit can access registers of either type. All data transfers between the integer and floating-point registers are done by the integer/floating-point conversion unit.
The instruction set utilized by the processor 10 of FIG. 1 is a subset of the RISC instruction set developed for the Distributed Instruction Set Computer. (See, e.g., L. Wang and C. L. Wu, "Distributed Instruction Set Computer Architecture", IEEE Trans. on Computer, 1991; and L. Wang, "Distributed Instruction Set Computer", Ph. D. Dissertation, Univ. Texas, Austin, December 1989; the contents of which are incorporated herein by reference). The instruction set consists of forty-nine machine instructions defined orthogonally in three formats: 3-operand, 2-operand and 1-operand. Illustratively, machine instructions are thirty-two bits in length.
The processor 10 of FIG. 1 executes a compiler. The compiler is comprised of two parts. Its front end transforms a high level program written, for example, in C language, onto a sequence of machine instructions taken from the instruction set described above. The back end of the compiler converts the sequential code into horizontal instruction words (HIWs). A horizontal instruction word comprises a plurality of sections or fields, wherein each field corresponds to a particular function unit and is capable of containing a machine instruction to be executed by that particular function unit. Each instruction thread to be executed by the processor 10 of FIG. 1 is made up of these horizontal instruction words. The compiler generates the horizontal instruction words by combining machine instructions that do not have data dependencies between them. Thus, a horizontal instruction word comprises data-independent instructions that can be issued in the same clock cycle. If there is no instruction for a particular function unit in a horizontal instruction word, the compiler inserts a NOOP (no operation) instruction into the appropriate section of the horizontal instruction word.
In the processor 10 of FIG. 1, the horizontal instruction words of each instruction thread are stored in an individual instruction cache 20. Each instruction cache 20 contains the horizontal instruction words of one instruction thread as compiled by the compiler in the manner described above.
The processor 10 of FIG. 1 also includes the instruction fetch units 24 which fetch instructions from the corresponding instruction caches under the control of the dynamic interleaving unit 26.
Each instruction passes through three main pipeline stages:
1. an instruction fetch stage, wherein an instruction is fetched from its cache, PA1 2. an instruction interleave/decode stage, wherein an instruction is interleaved, if possible, with other instructions from other threads by the dynamic interleaving unit 26, decoded by a decoding unit 27, and issued to the appropriate function unit 14, PA1 3. an execution stage, wherein the instruction is executed by a function unit and the results are written back into the register file 18. The compiler avoids contentions between function units through judicious code scheduling.
In the dynamic interleaving process, the dynamic interleaving unit selects a thread according to a priority scheme such as a round-robin scheme. The dynamic interleaving unit examines the next horizontal instruction word of that thread and replaces, if possible, every one of the NOOP instructions (introduced by the compiler) with a corresponding non-NOOP instruction from another thread. The individual instructions of the newly assembled horizontal instruction word are now sent to the appropriate decoders 27. The instruction decoder at each of the function units identifies the thread to which its instruction belongs and generates the appropriate control signals.
FIGS. 2A, 2B, 2C and 2D show an example illustrating dynamic interleaving. As shown in FIG. 2A, in this example, there are four function units: FU-1 which is an Integer Add/Sub Unit, FU-2 which is a logic unit, FU-3, which is a floating point/integer conversion unit, and FU-4 which is a floating point add/sub unit. There are also three instruction threads labeled 1, 2, and 3 which are stored in corresponding instruction buffers or caches. As shown in FIG. 2A, each instruction buffer contains two horizontal instruction words and each horizontal instruction word contains a section corresponding to each function unit.
The instructions shown in the instruction buffers are scheduled by the compiler statically. The example assumes a round-robin strategy in selecting the threads for interleaving and assumes that each thread has its own register set. As shown in FIG. 2B, at CK=1 (i.e. at a first clock cycle), the first horizontal instruction words from the three threads are fetched from the instruction caches and transmitted into the dynamic interleaving unit. Thread 1 is selected first (shown in bold letters in FIG. 2B). The ADD2 instruction of this thread is sent to the integer unit decoder. There are no more instructions in thread 1. Now thread 2 is selected. Because the ADD2 instruction of thread 1 has already been issued to the integer unit, the issue of the ADD2 instruction of thread 2 is deferred until the next clock cycle. However, the logic unit is free. Hence, the SHLL2 instruction of thread 2 is sent to the logic unit decoder. Because there are no more instructions in this thread, the third thread is selected. The FMOVEF instruction of the third thread is issued to the decoder of the floating/integer conversion unit as it is free. No more instructions can be issued now. Thus, at CK=1, an ADD2 instruction from thread 1, a SHLL2 instruction from thread 2, and a FMOVEF instruction from thread 3 are issued simultaneously to the function unit decoders.
At the end of the first clock cycle, threads 1 and 3 have no more instructions in the dynamic interleaving unit. Hence, the next horizontal instruction words from these threads are fetched from their respective instruction caches and transmitted into the dynamic interleaving unit. Thread 2, on the other hand, still has an ADD2 instruction to be issued and, therefore, the next horizontal instruction word from thread 2 is not fetched. The three horizontal instruction words which are in the dynamic interleaving unit at CK=2 are illustrated in FIG. 2C. At CK=2, thread 2 is selected first, following the round-robin strategy. Its ADD2 instruction is sent to the integer unit decoder. Next, thread 3 is selected and the SHLL2 and FSUB instructions from this thread are issued to the logic and floating point Add/Sub units, respectively. Finally, thread 1 is selected; but no instruction from thread 1 can be issued since the required function unit decoders are occupied. The process proceeds in a similar manner in the third clock cycle CK3. The instructions stored in the dynamic interleaving unit in the third clock cycle are shown in FIG. 2D as are the instructions issued to the function unit in the third clock cycle. By the end of the third clock cycle, all the instructions from the three threads are issued. In the absence of dynamic interleaving, it would take six clock cycles to issue the instructions in the example. Dynamic interleaving, thus, improves the instruction issue rate by a factor of two in the example.
The basic operation, performed in the dynamic interleaving unit, is the partial decoding of an instruction to see if it is a NOOP instruction. If an instruction is not a NOOP instruction, then the instruction is issued to the function unit decoder where the necessary control signals are generated. If the instruction is a NOOP, then the next instruction from a lower priority thread has to be checked. The checking is continued (in a domino fashion) until a non-NOOP instruction is encountered or until all the threads have been exhausted. The whole operation has to be performed in one clock cycle. FIG. 3 illustrates a logic circuit to achieve this. In FIG. 3, SW1 and SW2 are logic switches whose functions are as shown. The signal ND is the "NOOP Detected" signal, the result of partial instruction decoding, P is the priority signal. In any clock signal, only one thread has its priority signal high; the other priority signals are low. The signal IS the instruction issue signal. When high, it indicates that the instruction from the corresponding thread will be issued to the function unit decoder. The dynamic interleaving unit has a logic circuit, like the one in FIG. 3, for every function unit.
Although dynamic interleaving can achieve higher utilization of function units, some problems remain. First, the hardware has to support a higher instruction fetch bandwidth needed by the NOOP-replacing technique. In addition, implementation of the dynamic interleaving unit requires complex hardware including special hardware needed to detect the completion of one horizontal instruction word of a thread so that the next horizontal instruction word can be fetched and executed. Because all the instructions in the same horizontal instruction word are not guaranteed to be issued in the same clock cycle, two instructions that have write-after-read dependency cannot be put in the same horizontal instruction word. Also, two consecutive horizontal instruction words may be issued in non-consecutive cycles so instructions cannot be put in branch delay slots. These constraints will result in a lower instruction issue rate for a multi-threaded architecture.
In view of the foregoing, it is an object of the present invention to provide a multi-threaded architecture for a microprocessor which overcomes the problems associated with dynamic interleaving.