Multi-threaded parallel processing technologies have been employed in high-performance processors to reduce the impact of high-speed processor instruction execution latency caused by long pipelines. Multi-threaded parallel processing technologies have improved instruction per cycle performance and efficiency over other processor designs. Multithreading is a well-known technique for both hardware and software acceleration. The Delencor HEP processor was designed by Bruton Smith circa 1979 (see “www-ee.eng.hawaii.edu/˜nava/HEP/introduction.html” for more details). In this design, multiple instructions may be executed from a single thread. One requirement of the design of the Delencor HEP processor was that each software thread had to complete the current instruction prior to issuing a subsequent instruction. When each hardware thread unit (hereinafter “a context” or “hardware context” to distinguish it from a software thread) issues an instruction and progresses in sequence, it may be termed barrel multithreading or round robin scheduling.
In a multithreaded processor, all threads of execution operate simultaneously. In barrel multithreading, each hardware thread unit or context may be permitted to execute an instruction simultaneously but only one context may issue an instruction on a cycle boundary. Therefore, if there are C contexts, C cycles are required to issue an instruction from all contexts. All contexts execute simultaneously but only one context may issue instructions on a particular clock cycle. Each clock cycle, the contexts issue instructions and the thread that executes are implied to be permitted to execute by incrementing the context number.
FIG. 1 shows an example of a barrel multithreading execution sequence. In this example, a processor dispatches threads to contexts one instruction per clock cycle in the specific order of T0->T1->T2->T3->T4->T5->T6->T7->T0 . . . , going all the way around the chain until completing the circle and arriving again at T0. In a processor with 8 contexts, 8 clock cycles are required to dispatch a thread to all 8 contexts. Further, a single context (e.g. T0) may only issue an instruction once every 8 clock cycles.
FIG. 2 shows an example of execution of two instructions in a non-pipelined processor. To execute the two instructions addi and muli, each instruction progresses through four stages of operation. “addi” represents add immediate where the register r2 is added with the immediate value 8 and the result is stored in register r0. “muli” represents multiply immediate where register r3 is multiplied with the immediate value 4 and the result is stored in register r8. In the first stage, instruction fetch (IF), the instruction addi is fetched from memory and decoded. In the second stage, register read (RD), the operand r2 is read from a register file. In the third stage, execute (Ex), the operand from register r2 is added with the immediate value 8. In the fourth stage, write back (WB), the result is written into register r0. After the result is written, the next instruction (muli) is fetched. Thus, to execute N instructions, 4N clock ticks are required.
Pipelining is a technique that overlaps execution of multiple instruction by noting that, when an instruction leaves a particular stage of execution, that stage is dormant and if another instruction is available to execute, the latter instruction can begin execution prior to the completion of the previous instruction. FIG. 3 shows an example of pipelined overlapped execution. In a perfectly pipelined machine without hazards, rather than requiring 4N clock cycles to execute N instructions, perfect pipelining reduces the clock cycles to N+3.
However, pipelining may not be without hazards. FIG. 4 shows an example where the result of addi is required by the muli instruction. If the instruction is permitted to execute as shown in Error! Reference source not found., an incorrect result would be returned because the result is not placed into r0 until after the muli instruction reads r0. This is known as a read after write (RAW) hazard. To avoid this hazard, the pipeline must be interlocked and thus stall, creating what are known as pipeline bubbles. Pipeline interlocking may introduce non-determinism into the system. If N instructions have no hazards, then the execution time is N+3 clock cycles. If every instruction has a hazard then the execution time is 3N+1 cycles. A typical program P will be bounded by N+3<=P<=3N+1.
FIG. 5 shows an example of an execution sequence of a set of instructions in a barrel-threaded pipelined processor. In this example, three independent threads are executing on a pipelined processor that shares execution units. A first thread executes the original addi/muli instruction sequence 502, 504. In the absence of interrupts or long latency loads, there are never any pipeline interlocks because instructions 506, 508 from a second thread and a third thread, respectively, are inserted into the pipeline. Thus, the first thread does not encounter any hazards. In this example, N instructions will always complete in N+3 cycles without any hazards. However, those N instructions must be distributed across a sufficient number of threads to guarantee hazard-free execution. A drawback is that if only a single thread is present in the system, the single thread it will always require 3N+1 cycles to execute the program even if the instructions are hazard free.
A number of techniques have been developed in order to improve the performance of single threaded programs executing on multithreaded processors. One such technique is simultaneous multithreading (SMT) employed in a processor (see “www.cs washington.edu/research/smt/index.html” for more details). SMT has been employed in Intel's Hyper-Threading as described in “Intel Hyper-Threading Technology, Technical User's Guide,” IBM's POWER5 as described in Clabes, Joachim et al. “Design and Implementation of POWER5 Microprocessor,” Proceedings of 2004 IEEE International Solid-State Circuits Conference,” Sun Microsystems's Ultra SPARC T2 as described in “Using the Cryptographic Accelerators in the UltraSPARC T1 and T2 Processors,” Sun BluePrints Online, Sun Microsystems, retrieved 2008-01-09, and the MIPS MT as described in “MIPS32 Architecture,” Imagination Technologies, Retrieved 4 Jan. 2014.
Typical SMT-based processors have required each thread to have own set of registers and additional tracking logic at every stage of a pipeline within the SMT-based processor. This increases the size of hardware resources, specifically thread tracking logic needed to implement the design of the SMT-based processor. The thread tracking logic employed by the SMT-based processor is not only required to trace the execution of a thread but also is required to determine whether the thread has completed execution. Because the SMT-based processor may employ a large number of actively executing hardware contexts, the size of CPU caches and associated translation look-aside buffers (TLB) need to be large enough to avoid hardware context thrashing.
Although SMT technology may improve single-threaded performance, the above-identified control circuit complexity renders it difficult to apply SMT technology to embedded processors that require low-power consumption.
With simultaneous multithreading, multiple hardware thread units (hardware contexts) may issue multiple instructions each cycle. When combined with superscalar techniques such as out-of-order processing, the additional hardware required for SMT is not significant. However, care must be taken in the thread dispatch to ensure that all threads may issue instructions. To facilitate this, various techniques have been developed, including priority inversion and preemptive scheduling.
An advantage of simultaneous multithreading is that a single thread may issue instructions to the pipeline every clock cycle. Thus, a program P with only a single thread may execute in the N+3 cycles on a 4-stage pipeline in the absence of hazards. In reality, SMT's are almost always implemented with superscalar issue logic so that the number of required clock cycles are even further reduced by N+3/IPC (instructions per cycle). A key consideration of SMT processors is that execution time is no longer deterministic. However, single threaded performance is significantly enhanced at the expense of additional complex hardware.
To overcome SMT control circuit complexity and reduce power consumption, other forms of multi-threading technologies have been developed. Block multi-threading and interleaved multithreading have been proposed. Unfortunately, block multi-threading technology has been restricted to microcontrollers and other low-performance processors. Conventional interleaved multi-threading technology, also known as token-triggered multi-threading, has simplified control circuitry but performance suffers when there are fewer software threads than available hardware contexts in the processor. This technology been promoted in certain high-performance low-power processors. A representative example of token-triggered multi-threading technology is described in U.S. Pat. No. 6,842,848.
Conventional token-triggered multi-threading employs time sharing. Each software thread of execution is granted permission by the processor to executed in accordance with its own assigned clock cycles. Only one software thread per clock cycle is permitted to issue commands. A token is employed to inform a software thread as to whether the software thread should issue an instruction in the next clock cycle. This further simplifies hardware context logic. No software thread may issue a second instruction until all software threads have issued an instruction, if a software thread has no instruction available to issue, a no operation (NOP) is issued by the hardware context. Processor hardware ensures that each software thread has the same instruction execution time. The result of an operation may be completed within a specified guarantee period of time (e.g., clock cycles). Accordingly, no instruction execution related inspection and bypass hardware is needed in the processor design.
Conventional token-trigger multi-threading technology simplifies the hardware issue logic of a multi-threaded processor and, accordingly, may achieve high performance with very little power consumption. However, compared with SMT technologies, the performance improvement of a conventional token-trigger multi-threading processor is limited if there are fewer software threads having executable instructions during a clock cycle than available hardware contexts. In such circumstances, hard contexts that do not have software threads assigned to them must issue NOPs.
In order to avoid interference between software threads and to simplify the hardware structure, conventional token triggered multithreading employs a time sharing strategy that can cause a small number of instructions to be executed per cycle. This reduces the processing speed of a single-threaded operation. For example, if the software instruction for context T1 is not in the cache and requires a reload from external memory, due to the slow speed of the external memory, T1 has to wait for many cycles to reload instructions. If context T0 has an instruction ready, it still must wait to issue the instruction at clock cycle C1. However, because of the structural limitations of the time shared data path, clock cycle C1 can only be used by context T1 and in this case the hardware context must issue a NOP.
In the worst case of a single software thread of execution, the performance of a corresponding conventional token-triggered processor is 1/T (where T is the number hardware contexts). In a 10-threaded token-triggered processor running at 1 GHz, the performance of the processor is effectively reduced to 100 MHz.
To avoid thrashing and simplify the tracking circuits between hardware context threads, in the Sandblaster 2.0 processor, each hardware context has its own separate instruction memory as described in “The Sandblaster 2.0 Architecture and SB3500 Implementation Proceedings of the Software Defined Radio Technical Forum (SDR Forum '08),” Washington, D.C., October 2008. Unfortunately, the individual instruction memories cannot be shared between hardware contexts. This may result in underutilized memory resources in addition to reduced performance when the number of software threads is fewer than the number of hardware contexts.