1. Technical Field
The present invention relates generally to advanced computer architectures. More specifically, the present invention provides a multithreaded processor architecture that aims at simplifying the programming of concurrent activities for memory latency hiding and multiprocessing without sacrificing performance.
2. Description of the Related Art
Multithreaded architectures (also referred to as multiple-context architectures) use hardware-supported concurrency to hide the latency associated with remote load and store operations. In this context, it is important to understand what is meant by “concurrency,” as the term may be easily confused with “parallelism.” In parallel execution, multiple instructions are executed simultaneously. In concurrent execution, multiple streams of instructions, referred to here as threads, are maintained simultaneously, but it is not necessary for multiple individual instructions to be executed simultaneously. To make an analogy, if multiple workers in an office are working simultaneously, one could say that the workers are working in parallel. On the other hand, a single worker may maintain multiple projects concurrently, in which the worker may switch between the different currently maintained projects, working a little on one, switching to another, then returning to the first one to pick up where he/she left off. As can be observed from this analogy, the term “concurrent” is broader in scope than “parallel.” All parallel systems support concurrent execution, but the reverse is not true.
Another useful analogy comes from the judicial system. A single judge may have many cases pending in his or her court at any given time. However, the judge will only conduct a hearing on a single case at a time. Thus, the judge presides over multiple cases in a concurrent manner. A single judge will not hear multiple cases in parallel, however.
Multithreaded architectures provide hardware support for concurrency, but not necessarily for parallelism (although some multithreaded architectures do support parallel execution of threads). Supporting multiple concurrent threads of execution in a single processor makes memory latency hiding possible. The latency of an operation is the time delay between when the operation is initiated and when a result of the operation becomes available. Thus, in the case of a memory-read operation, the latency is the delay between the initiation of the read and the availability of the data. In certain circumstances, such as a cache miss, this latency can be substantial. Multithreading alleviates this problem by switching execution to a different thread if the current thread must wait for a reply from the memory module, thus attempting to keep the processor active at all times.
Returning to the previous office worker example, if our hypothetical office worker needs a piece of information from a co-worker who is not presently in the office, our office worker may decide to send the co-worker an e-mail message. Rather than sit idle by the computer to await a reply to the message (which would incur a performance or “productivity” penalty), the worker will generally switch to some other task to perform in the meantime, while waiting for the reply. This “hides” the latency, because the worker is still able to perform productive work on a continuous basis. Multithreaded architectures apply the same principle to memory latency hiding in processors.
In order to maintain multiple threads of execution, the current execution state, or context, of each thread must be maintained. Hence, the term “multithreaded architecture” is synonymous with the term “multiple context architecture.” The act of switching between different threads is thus known as context switching. Returning to the previous judge analogy, context information is like a docket: it describes the current state of a thread so that execution can be resumed from that state, just as a judge's docket tells the judge about what motions are outstanding, so that the judge knows what rulings will need to be made when the case comes on for hearing. In the case of a computer program, it is the processor state (for example: program counter, registers, and status flags) that makes up the context for a given thread.
Multithreaded execution and context switching are commonly employed in software as part of a multitasking operating system, such as AIX (Advanced Interactive executive), a product of International Business Machines Corporation of Armonk, NY. Software instructions are used create and destroy threads, as well as to periodically switch between different threads' contexts. Multithreaded processors, on the other hand, provide built-in hardware support for thread creation/deletion and context switching.
Gamma 60 was the first multithreaded system on record. Gamma 60 was designed and produced by Bull GmbH in Cologne (Koln) in the 1950's. Decades later, Burton Smith pioneered the use of multithreading for memory latency hiding in multiprocessors. He architected HEP in the late 1970's, later Horizon, and more recently Tera (described in U.S. Pat. No. 4,229,790 (GILLILAND et al.) Oct. 21, 1980). Threading models appeared in the late 80's, such as the Threaded Abstract Machine (TAM). Cilk, an algorithmic multithreaded programming language, appeared in the mid 90's.
A number of existing patents are directed to multithreaded architectures. U.S. Pat. No. 5,499,349 (NIKHIL et al.) Mar. 12, 1996 and U.S. Pat. No. 5,560,029 (PAPADOPOULOS et al.) Sep. 24, 1996, both assigned to Massachusetts Institute of Technology, describe multithreaded processor architectures that utilize a continuation queue and fork and join instructions to support multithreading. U.S. Pat. No. 5,357,617 (DAVIS et al.) Oct. 18, 1994, assigned to International Business Machines Corporation, is another example of an existing multithreaded architecture design.
Another related technology is SMT (simultaneous multithreading, hyperthreading/Intel, etc.), which integrates multithreading with superscalar architecture/instruction-level parallelism (ILP). SMT, however, is very complex and power-consuming. U.S. Pat. No. 6,463,527 (VISHKIN) Oct. 8, 2002 is an example of such a multithreaded processor with ILP.
Some multithreaded processors are able to hide the latency associated with performing memory operations, such as loads and stores. However, other operations, such as arithmetic operations, for example, still impose a substantial performance penalty due to the latencies of the different functional units used to perform those operations.
What is needed, therefore, is a method and system for hiding the latency of non-memory-access operations in a multithreaded processor pipeline. The present invention provides a solution to this and other problems, and offers other advantages over previous solutions.