1. Field of the Invention
The present invention relates generally to the field of microprocessors, and, more particularly, to a banked shadowed register file for minimizing thread switch overhead in multithreaded processing applications so as to improve processor performance.
2. Description of the Related Art
Improving processor performance is a paramount goal in the competitive field of microprocessing. Significant advances have been made in this regard by increasing the speed at which processors operate. Although increasing processing speed is generally advantageous in terms of providing a larger number of clock cycles per unit of time, a drawback nonetheless exists in that the speed of conventional processors now far outpaces the speed at which memory dependent operations can be performed. Depending upon the memory level and operation, this can result in significant memory-latencies. The resulting latencies cause pipeline stalls, which disadvantageously restrict processor throughput.
Multithreading is a processing technique designed to minimize the adverse effect f pipeline stalling on processor performance. As used herein, the term "thread" is defined as including an individual software program or independent sub-program generated when a full program is compiled. Threads are quite often dependent upon long latency operations, such as those associated with instruction fetches, cache misses, unresolved data dependencies, and branch latencies. These long latency operations typically cause the execution core of the processor to stall and remain idle for many cycles. Multithreading circumvents these idle cycles by reassigning the execution resources of the processor to one or more new threads when a currently executing thread stalls waiting for dependent operations. In this fashion, multithreading advantageously improves processor performance by hiding latencies with the performance of useful work cycles.
Multithreading may take one of three general forms. Course grained multithreading is characterized as having the processor support only one active thread at a time by limiting instructions from only one thread in the execution pipeline. Fine grained multithreading is characterized as having the processor support multiple active threads while issuing instructions from only one thread during a given clock cycle. Simultaneous mulithreading is characterized as having the processor issue and execute instructions from multiple threads during each clock cycle. In each instance, multithreading makes efficient use of the processor during clock cycles that would otherwise be wasted due to latencies.
Multithreading is particularly advantageous in server applications in that it can inexpensively boost throughput and do so without the need for multiple processors. The demand for inexpensive servers has increased rapidly in the recent past due, in part, to the proliferation of the Internet. To meet this demand, various servers have been designed with multiple processors for improving throughput. However, due to the dramatic drop in the cost of memory and disk-storage, providing multiple processors now represents a significant portion of the total cost of these servers. Multithreading overcomes this by providing the ability to simultaneously handle the individual tasks for a multitude of different users without requiring multiple processors. This is particularly advantageous in server applications such as on-line-transaction-processing, which may spend up to 30% of the processing time waiting for main memory to return data to the processor.
Conventional microprocessors are single-threaded in that they provide only one set of architectural registers, namely, a register file for maintaining a thread's architectural state during execution. As such, conventional processors are best suited for course grained multithreading since this type of multithreading requires supporting only one thread at a time. However, before another thread can begin, the current thread's state must be saved in memory so it can properly resume later. This process, referred to as a "thread switch," involves flushing the pipeline of instructions from the current thread, saving the thread's architectural state, and providing instructions from the new thread to the processor. The amount of time required to complete the thread switch process is referred as "thread switch overhead." Depending upon the number of registers and cache misses incurred, it may take a conventional processor hundreds of clock cycles to complete a thread switch. Course grained mulithreading using conventional processors is, therefore, only worthwhile when the memory latency to be avoided is sufficiently greater than the thread switch overhead.
One prior art technique for reducing thread switch overhead involves providing each thread with its own set of architectural registers. This approach suffers a significant drawback, however, in that adding extra register files on the processor requires establishing a direct connection between the processor's execution core and each newly added register file. With the advances in integrated circuit manufacturing, the space consumed by the direct connections between the extra register files and the execution core cuts significantly into the total amount of transistors that can be provided on the processor. As such, adding a separate register file for each thread is not cost effective in that it consumes a substantial amount of space in order to couple each register to the execution core of the processor.
What is needed therefore is an apparatus and method for reducing thread switch overhead in a course grained multithreaded application which effectively improves processor efficiency while consuming negligible space on the processor.