Multi-threading parallel processing techniques are widely applied in the design of high-performance processors to reduce the impact of waiting cycles during instruction execution of high-speed processors and thereby improve the performance and operating efficiency of the processors. The most commonly used multi-threading technique is simultaneous multi-threading (SMT). For example, Intel's Hyper-Threading, IBM's POWER5, Sun Microsystems' UltraSPARC T2 and MIPS MT all employ the SMT technique.
With the SMT technique, not only are a separate set of registers required for instruction execution for each thread, but also thread tracking logic has to be added causing increased sizes of shared resources, such as instruction caches and TLBs, etc. The thread tracking logic not only keeps track of the progress of the thread but also checks whether the execution of the thread has been completed. Since a large number of threads that are in an execution state or semi-execution state may exist, the sizes of the caches and TLBs of the CPU must be large enough to avoid unwanted thrashing among the threads.
Though the SMT technique can improve the operational capability of the processor, it is difficult to use in the design of embedded processors and low-power processors, because it results in significantly increased complexity of the hardware.
To overcome the complexity of SMT multi-threading control circuits and to reduce power consumption, a simplified time-sharing multi-threaded technique has been used. The time-shared multi-threading technique means that only one thread can operate in a specific instruction cycle. It can be categorized into block multi-threading and interleaved multi-threading. The block multi-threading technique is usually used for low-performance processors such as micro-controllers because its contribution to the improvement of operating efficiency of the processor is very limited. The interleaved multi-threading technique has been applied to some extent to high-performance and low-power processors. Its control circuit is simple but it can attain higher operational capability and efficiency than those of single-thread processors. In the interleaved multi-threading technique, a representative technique is token triggered multi-threading technique.
The token triggered interleaved multi-threading technique has the following features:                (1) It is a time-shared execution process. Each thread is executed in the clock cycles granted to the thread. Only one thread can issue instructions in a specific clock cycle.        (2) After a thread is executed, it will indicate which thread should be started in the next cycle. This approach greatly simplifies hardware selection for threads.        (3) The hardware ensures that each thread is provided with the same instruction execution time.        (4) The operation result can be obtained within specified cycles. Therefore, the instructions do not have to use dependency checking and bypass hardware.        
FIG. 1 shows a timing sequence diagram of multi-threaded execution of a four-thread token triggered multi-threading mechanism.
The Token triggered multi-threading technique has a great contribution to simplification of the multi-threading hardware structure and reduction of power consumption, but causes degraded operating efficiency of the operating units of the processor, especially the processing efficiency for a single thread; consequently, the processing capacity of the processor is much lower than that of a processor that employs SMT technique.
The token triggered multi-threading structure of Sandblaster 2.0 has the following drawbacks:
1. The time-shared sequential execution strategy employed for preventing mutual interference among threads and simplifying hardware structures causes degraded operating efficiency of the clock cycle and degraded processing capacity for a single thread. For example, in case a thread T1 has to get an instruction from an external storage device because the current instruction is missed, the thread T1 may not be able to get an instruction in a timely manner since the external storage has a lower operating speed; meanwhile, a thread T0 has an instruction to be executed; however, the clock cycle C1 can only be used by the thread T1 owing to structural constraints; in that case, the clock cycle C1 is wasted.
2. To avoid thrashing among threads and simplify the tracking circuits, Sandblaster 2.0 is designed in a way that each thread has a separate instruction cache. The instruction caches cannot be shared among the threads, resulting in a significant waste of the memory resource.