Multi-threaded parallel processing technologies have be employed in high-performance processors to reduce the impact of high-speed processor instruction execution latency caused by long pipelines. Multi-threaded parallel processing technologies have improved instruction per cycle performance and efficiency over other processor designs. The most common type of multithreading in general purpose processors is simultaneous multi-threading technology (SMT). SMT has been employed in Intel's Hyper-Threading as described in “Intel Hyper-Threading Technology, Technical User's Guide,” IBM's POWER5 as described in Clabes, Joachim et al. “Design and Implementation of POWER5 Microprocessor,” Proceedings of 2004 IEEE International Solid-State Circuits Conference,” Sun Microsystems's Ultra SPARC T2 as described in “Using the Cryptographic Accelerators in the UltraSPARC T1 and T2 Processors,” Sun BluePrints Online, Sun Microsystems, retrieved 2008 Jan. 9, and the MIPS MT as described in “MIPS32 Architecture,” Imagination Technologies, Retrieved 4 Jan. 2014.
Typical SMT-based processors have required each hardware thread to have its own set of registers and additional tracking logic at every stage of a pipeline within the SMT-based processor. This increases the size of hardware resources, specifically thread tracking logic needed to implement the design of the SMT-based processor. The thread tracking logic employed by the SMT-based processor is not only required to trace the execution of a hardware thread but also is required to determine whether the hardware thread has completed execution. Because the SMT-based processor may emply a large number of actively executing hardware threads, the size of CPU caches and associated translation look-aside buffers (TLB) need to be large enough to avoid hardware thread thrashing.
Although SMT technology may improve single-threaded performance, the above-identified control circuit complexity renders it difficult to apply SMT technology to embedded processors that require low-power consumption.
To overcome SMT control circuit complexity and reduce power consumption, other forms of multi-threading technologies have been developed. Block multi-threading and interleaved multithreading have been proposed. Unfortunately, block multi-threading technology has been restricted to microcontrollers and other low-performance processors. Interleaved multi-threading technology has simplified control circuitry but performance suffers when there are fewer software threads than available hardware threads in the processor. This technology been promoted in certain high-performance low-power processors. A representative example of Token-triggered multi-threading technology is described in U.S. Pat. No. 6,842,848.
Token-triggered multi-threading employs time sharing. Each software thread of execution is granted permission by the processor to executed in accordance with its own assigned clock cycles. Only one software thread per clock cycle is permitted to issue commands. A token is employed to inform a software thread as to whether the software thread should issue an instruction in the next clock cycle. This further simplifies hardware thread logic. No software thread may issue a second instruction until all software threads have issued an instruction. If a software thread has no instruction available to issue, a no operation (NOP) is issued by the hardware thread. Processor hardware ensures that each software thread has the same instruction execution time. The result of an operation may be completed within a specified guarantee period of time (e.g., clock cycles). Accordingly, no instruction execution related inspection and bypass hardware is needed in the processor design.
Token-trigger multi-threading technology simplifies the hardware issue logic of a multi-threaded processor and, accordingly, may achieve high performance with very little power consumption. However, compared with SMT technologies, the performance improvement of a token-trigger multi-threading processor is limited if there are fewer software threads having executable instructions during a clock cycle than available hardware threads. In such circumstances, hardware threads that do not have software threads assigned to them must issue NOPs.
Further, in order to avoid the interference between software threads and to simplify the hardware structure, conventional token triggered multithreading employs a time sharing strategy that can cause a low number of instructions to be executed per cycle. This reduces the processing speed of a single-threaded operation. For example, if the software instruction for context T1 is not in the cache and requires a reload from external memory, due to the slow speed of the external memory, T1 has to wait for many cycles to reload instructions. If context T0 has an instruction ready, it still must wait to issue the instruction at clock cycle C1. However, because of the structural limitations of the time shared datapath, clock cycle C1 can only be used by context T1 and in this case the hardware thread must issue a NOP.
In the worst case of a single software thread of execution, the performance of a corresponding conventional token-triggered processor is 1/T (where T is the number hardware threads). In a 10-threaded token-triggered processor running at 1 GHz, the performance of the processor is effectively reduced to 100 MHz.
To avoid thrashing and simplify the tracking circuit between hardware threads, in the Sandblaster 2.0 processor, each hardware thread has its own separate instruction memory as described in “The Sandblaster 2.0 Architecture and SB3500 Implementation Proceedings of the Software Defined Radio Technical Forum (SDR Forum '08),” Washington, D.C., October 2008. Unfortunately, the individual instruction memories cannot be shared between hardware threads. This may result in underutilized memory resources in addition to reduced performance when the number of software threads is fewer than the number of hardware threads.