In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the “throughput”) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor. E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant clock speed improvements by shrinking and combining components, eventually packaging the entire processor as an integrated circuit on a single chip, and increased clock speed through further size reduction and other improvements continues to be a goal.
In addition to increasing clock speeds, it is possible to increase the throughput of an individual CPU or a system by increasing the average number of operations executed per clock cycle. Modern computer systems are designed to perform many operations concurrently, in order to increase the average number of operations executed in a given time. Parallelism of various types is a common technique for boosting system throughput. For example, the reduced size and cost of individual processors has made it feasible, indeed common, to provide multiple CPUs operating in parallel in a single computer system.
One particular form of parallelism in computer design is the use of hardware multithreading within a computer processor. The term “multithreading”, as used in the context of processor design, is not the same as the software use of the term, and for this reason the phrase “hardware multithreading” is often used to distinguish multithreading in the context of processor design from “multithreading” or “multiprogramming” in the context of software. The software use of the term means that a single process or task is subdivided into multiple related threads, which are capable of being dispatched independently for execution. Hardware multithreading involves the concurrent execution of multiple software threads within a single processor. These threads may represent completely independent tasks which are unrelated to one another. As used herein, the term “multithreading” refers to hardware multithreading, unless otherwise qualified.
A processor which supports hardware multithreading can support multiple active threads at any instant in time. I.e., the dispatcher in the operating system can dispatch multiple threads to the same processor concurrently. From the perspective of the operating system, it appears that there are two processors, each executing a respective thread. There are multiple approaches to hardware multithreading. In a more traditional form, sometimes called “fine-grained multithreading”, the processor executes N threads concurrently by interleaving execution on a cycle-by-cycle basis. This creates a gap between the execution of each instruction within a single thread, which tends to reduce the effect of waiting for certain short term latency events, such as waiting for a pipeline operation to complete. A second form of multithreading, sometimes called “coarse-grained multithreading”, multiple instructions of a single thread are executed exclusively until the processor encounters some longer term latency event, such as a cache miss, at which point the processor switches to another thread. In a third form of multithreading, herein referred to as “dynamic multithreading”, an instruction unit in the processor selects one or more instructions from among multiple threads for execution in each cycle according to current processor and thread state.
Regardless of the type of hardware multithreading employed, all hardware multithreading tends to increase the productive utilization of certain processor resources because one or more active threads can exploit processor resources to execute instructions even while other threads are stalled, as for example, when waiting for a cache line to be filled. I.e., in a processor which supports only a single thread, some processing resource, such as a pipeline, may have to wait idle on any of numerous latency events. However, if multiple threads are active in the processor, the probability that the resource can be utilized in increased. Put another way, a multithreaded processor increases the average number of operations executed per clock cycle in comparison to a similar processor which supports only a single thread.
Typically, hardware multithreading involves replicating certain processor registers for each thread in order to independently maintain the states of multiple threads. For example, for a processor implementing a PowerPC™ archticture to perform multithreading, the processor must maintain N states to run N threads. Accordingly, the following are replicated N times: general purpose registers, floating point registers, condition registers, floating point status and control register, count register, link register, exception register, save/restore registers and special purpose registers. Additionally, certain special buffers, such as a segment lookaside buffer, can be replicated or each entry can be tagged with a thread number. Also, some branch prediction mechanisms, such as the correlation register and the return stack, should also be replicated. However, larger hardware structures such as caches and execution units are typically not replicated, and are shared by all threads.
Thus, it can be seen that hardware multithreading involves replication of hardware in the form of additional registers and other structures needed to maintain state information. While the number of threads supported can vary, each thread requires additional hardware resource which must be justified by the increase in utilization of the shared hardware resources, such as execution units. The marginal improvement in utilization declines as more threads are added and the shared hardware resources become more fully utilized, while the cost of each additional thread is relatively constant. Therefore the number of threads supported in most hardware multithreading processors is relatively small, with two being a common number.
In many system architectures, certain threads representing interrupts and other special processes run at a high priority. A particular example is input/output (I/O) bound threads, i.e., threads which service I/O processes. Generally, these threads spend most of their time in a wait state waiting for I/O completion, and when executing, execute often, but only briefly and do not require large hardware resource. When such a thread is waiting on an event and the event occurs, the operating system dispatcher often dispatches the thread immediately to a processor (due to its high priority), causing some currently executing thread to be pre-empted.
Although each I/O bound thread may execute only briefly when dispatched, the cumulative effect of numerous high-priority pre-emptions can reduce the efficiency of system operation. There is some overhead involved in pre-empting a currently executing thread, saving its state, and dispatching the I/O bound thread to the processor, and multiplied by many such events this becomes significant additional work. Additionally, a high priority thread has a tendency to flush the contents of cache, even when executing only briefly. I.e., it will fill the cache, and particularly the high-level cache nearest the processor, with data it requires, resulting in the removal of data needed by other threads.
It is desirable to find improved techniques for processor operation and design which will avoid or mitigate some of the undesirable side effects of servicing such high-priority threads.