1. Technical Field of the Invention
The present invention relates in general to a method and apparatus for partitioning a processor register set to improve the performance of multi-threaded operations. More particularly, the present invention relates to a method and apparatus for retrofitting multi-threaded operations on a conventional computer architecture. Still more particularly, the present invention relates to a method and apparatus for partitioning the processor register set and managing the register subsets to improve multi-threading performance of a computer.
2. Description of Related Art
Single tasking operating systems have been available for many years. In single tasking operating systems, a computer processor executes computer programs or program subroutines serially. In other words, a computer program or program subroutine must be completely executed before execution of another program or subroutine can begin.
Single tasking operating systems are inefficient because the processor must wait during the execution of some steps. For example, some steps cause the processor to wait for a data resource to become available or for a synchronization condition to be met. To keep the processor busy and increase efficiency, multi-threaded operating systems were invented.
In multi-threaded operating systems, the compiler breaks a task into a plurality of threads. Each of the threads performs a specific task which may be executed independently of the other threads. Although the processor can execute only one thread at a time, if the thread being executed must wait for the occurrence of an external event such as the availability of a data resource or a synchronization event, then the processor switches threads. Although thread switching itself requires a few processor cycles, if the waiting time exceeds this switching time, then processor efficiency is increased.
Accessing internal state, for example on-chip processor registers, generally requires fewer processor clock cycles than accessing external state, for example cache or memory. Increasing the number of registers inside the processor generally decreases the probability of external accesses to cache or memory. In other words, to decrease the amount of external state memory requests, the prior art generally increases the number of processor registers.
For example, the latest generations of instruction set architectures, including RISC (Reduced Instruction Set Computers) and VLIW (Very Long Instruction Word) processors, typically improve execution of a single task by increasing the number of registers. Such processors often have 64 to 256 registers capable of retaining integer and/or floating point values.
Computer system architectures and programming trends are moving toward multi-threaded operations rather than a single, sequential tasks. To multithread an operation, each task is decomposed by the compiler into more than one thread. Because threads tend to run for much shorter intervals before being completed than a single large task, threads tend to have a smaller associated state per thread. In other words, each thread of a multithreaded operation tends to require fewer associated registers than a single large task which generally requires a large number of registers to execute.
Threads typically are allowed to run until a thread switch event occurs. A thread switch event occurs, for example, when a referenced memory location is not found in the cache or a program-defined synchronization condition is not met. For example, when an L2 cache miss occurs, then the main memory must be accessed which is, of course, very time consuming. Instead of waiting, the processor switches threads.
When a thread is suspended due to a thread switch event, its inactive or NOT READY state may be retained within the processor registers. In the prior art, however, if a given thread does not resume execution within a few thread commutations, the finite register storage available within the processor leads to swapping of thread state between the processor and memory. In other words, the prior art swaps the entire thread context between the inactivated thread and the next thread to be processed.
Thread switching requires several processor cycles and directly competes for processor, bus and memory resources. Because the prior art switches the entire thread state upon a thread switch event, good multithreading performance dictates a reduced internal state or, in other words, a smaller amount of registers within the processor.
Thus, there is a conflict between established processor instruction set architectures optimized for a single task which require a large number of internal processor registers and the demands of newer, multithreaded architectures and programming systems which require relatively few internal processor registers for high-performance, multithreading operations.
Furthermore, the computer industry has a tremendous investment in software and hardware embodying existing instruction set architectures. As a result, it is very difficult to successfully introduce hardware and software which embodies a new and incompatible instruction set architecture.
For example, adding hardware to duplicate the register set is a known technique for increasing multithreaded performance. In other words, the prior art duplicates the entire register set including special purpose registers and general purpose registers so that each thread has its own dedicated register set to facilitate thread switching. Register set duplication, however, greatly increases the circuit complexity and makes the circuit layout more difficult to implement.