An example of a multi-threaded processor is described in our U.S. Pat. No. 5,968,167. This discloses a processor which executes each of a plurality of threads in dependence on the availability of resources which each thread requires for it to execute. Selection between threads for execution is performed by a media control core or arbiter which determines which thread should execute and switches between threads as appropriate.
Such a multi-threaded processor will have a separate set of registers which store the program state for each of a number of programs or executing threads. When the resources required by one of the threads is not available e.g. it is waiting for a memory access, then the thread is prevented from continuing and the processor switches to another thread which has all the resources it requires available and is therefore able to continue execution. The arbitration between threads is organised so that the processor is whenever possible executing useful instructions instead of idling and thereby the use of the processor is optimised. When a thread is not executing, the set of registers store its current state.
One factor which is critical in obtaining optimised usage of the processor is the time overhead required to swap execution between threads. If this is similar to the waiting time for particular threads such as waiting for a memory access, then there is no net gain in processor efficiency in switching between executing threads. It has therefore been appreciated that fast swapping between thread execution is required to optimise processor efficiency. Fast thread swapping is helped by having separate sets of registers for the program states stored for each thread.
As discussed above, the state for an executing thread is stored in a set of registers. To get maximum performance from these registers it is common for them to be read at least twice and written to at least once within each clock cycle.
This results from the structure of machine code instructions. An example is an “ADD” instruction. This takes the contents of two source registers, performs a summation on them, and then stores the result back in the register store. In order for this to be executed in one clock cycle, the register storage requires two read ports and one write port, the two read ports to provide the two pieces of data on which the summation is to be performed and the write port to enable the result to be written back to the register. The problem with this is that as the number of ports on a register store is increased, the area of silicon required to produce the store increases significantly and as a result the speed of operation reduces. The cost of the device also increases.
A multi-ported register storage has to increase in depth by the number of threads which require the fast switching ability. For example, if a processor has sixteen registers and it is required that four threads have to switch efficiently then a register storage of four times sixteen is required, sixteen register stores per thread. Therefore, the silicon area required for the register storage is a function of the number of ports and the number of threads.