The fundamental structure of a modern computer includes peripheral devices to communicate information to and from the outside world; such peripheral devices may be keyboards, monitors, tape drives, communication lines coupled to a network, etc. Also included in the basic structure of the computer is the hardware necessary to receive, process, and deliver this information from and to the outside world, including busses, memory units, input/output (I/O) controllers, storage devices, and at least one central processing unit (CPU), etc. The CPU is the brain of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, all systems, from the earliest to the most modern, operate in fundamentally the same manner. Processors actually perform very simple operations quickly, such as arithmetic, logical comparisons, and movement of data from one location to another. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system, however, may actually be the machine performing the same simple operations, but much faster. Therefore, continuing improvements to computer systems require that these systems be made ever faster.
One measurement of the overall speed of a computer system, also called the "throughput", is measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, particularly the clock speed of the processor. So that if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Computer processors which were constructed from discrete components years ago performed significantly faster by shrinking the size and reducing the number of components; eventually the entire processor was packaged as an integrated circuit on a single chip. The reduced size made it possible to increase the clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems still exists. Hardware designers have been able to obtain still further improvements in speed by greater integration, by further reducing the size of the circuits, and by other techniques. Designer, however, think that physical size reductions cannot continue indefinitely and there are limits to continually increasing processor clock speeds. Attention has, therefore, been directed to other approaches for further improvements in overall speed of the computer system.
Without changing the clock speed, it is still possible to improve system speed by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this practical. The use of slave processors considerably improves system speed by off-loading work from the CPU to the slave processor. For instance, slave processors routinely execute repetitive and single special purpose programs, such as input/output device communications and control. It is also possible for multiple CPUs to be placed in a single computer system, typically a host-based system which services multiple users simultaneously. Each of the different CPUs can separately execute a different task on behalf of a different user, thus increasing the overall speed of the system to execute multiple tasks simultaneously.
It is much more difficult, however, to improve the speed at which a single task, such as an application program, executes. Coordinating the execution and delivery of results of various functions among multiple CPUs is a tricky business. For slave I/O processors, this is not so difficult because the functions are pre-defined and limited but for multiple CPUs executing general purpose application programs, it is much more difficult to coordinate functions because, in part, system designers do not know the details of the programs in advance. Most application programs follow a single path or flow of steps performed by the processor. While it is sometimes possible to break up this single path into multiple parallel paths, a universal application for doing so is still being researched. Generally, breaking a lengthy task into smaller tasks for parallel processing is done by a software engineer writing code on a case-by-case basis. This ad hoc approach is especially problematic for executing commercial transactions which are not necessarily repetitive or predictable.
Thus, while multiple processors improve overall system performance, there are still many reasons to improve the speed of the individual CPU. If the CPU clock speed is given, it is possible to further increase the speed of the CPU, i.e., the number of operations executed per second, by increasing the average number of operations executed per clock cycle. Recent advances towards this aim have yielded the superscalar computer which typically executes up to four instructions per processor clock cycle. A Very Long Instruction Word (VLIW) computer may execute sixteen instructions or more per processor cycle.
Various processor hardware design techniques have been used to increase the average number of operations executed per clock cycle. These have involved, for example, the use of pipelines, multiple execution units with look ahead hardware for finding instructions to execute in parallel, etc. Limited improvement is possible using these techniques, but the hardware support required is often massive. Another technique to improve the efficiency of hardware within the CPU is to divide a processing task into segments called threads. This technique is related to breaking a larger task into smaller tasks for independent execution by different processors except here the threads are to be executed by the same processor. When a CPU then, for any of a number of reasons, cannot continue the processing or execution of one of these threads, the CPU switches to and executes another thread.
The CPU is actually an arrangement of integrated circuits, including at least one instruction control unit and an arithmetic and logic unit, that interprets and executes computer instructions. Within the CPU there is a processor core containing specialized functional units, each of which perform primitive operations, such as sequencing instructions, executing operations involving integers, executing operations involving real numbers, transferring values between addressable storage and logical register arrays; those simple operations discussed earlier. Processor cores may have many of these specialized functional units either to achieve higher performance under peak requirements, or because the computer architecture requires more functional units. A single instruction or multiple instructions may dispatch operations to more than one functional unit in a single cycle of the processor's clock. In actuality, however, peak performance is rarely demanded; the duty cycle over time of any one functional unit is less than one hundred percent of the available clock cycles. Hence, there is idle time.
As discussed earlier, those parallel and sequential sets of instructions that can execute separately are called "threads of control" or, simply, "threads." A processor which has the capability to concurrently maintain more than one path of execution within a computer is called a multithreaded processor. The multithreaded processor usually has at least one backup register which has the data for a second thread while a first thread is executing. Commutation, also called context switch or thread switch, refers to the process of switching data and the state of certain registers associated with a particular thread out of one register set so that data and other state information associated with another thread can be switched into the first register set for execution. In a processor that doesn't support multithreading, however, context switch would swap data from one common set of operating registers with data in memory locations. The state of a thread includes all information necessary for a thread to execute.
The particular events that trigger the commutation of execution resources from one thread to another and the frequency of commutating threads are determined by the processor architecture and implementation. One of the events which may trigger a context or thread switch in a multithreaded system is an explicit call to a centralized executive program, such as an operating system, to execute another task. In this case, the state of the first task in the operating registers must be saved before the state of the second or called-to task is brought into the operating registers. In such a multi-tasking systems where the executive programs call for a thread switch, commutation may be so infrequent that hardware exclusively dedicated for rapid commutation is superfluous.
Occasionally an instruction stream requires information either from storage or a subsequent instruction in which the data is not available because, for instance, the value or the instruction may not be immediately accessible to the processor. The thread then is unable to continue execution. A multithreaded system may then commutate control of the processor core including the thread state registers to another thread while waiting for distant memory or another processor or another functional unit to provide the desired value or instruction. Latency is the time, often measured by processor cycles, required for data and/or instructions from these other components of the computer system to become available to the processor. Latency can further increase if coherency is required for the storage of data and/or instructions across a memory hierarchy. Commutation, therefore, is especially beneficial if a storage reference implies coherent operations involving shared memory multiprocessors, such as non-uniform memory access (NUMA), or other processors' caches or memories. In some of these instances, hardware support for switching threads is imperative when the latency to complete a shared memory reference is well bounded, i.e., it is on the order of tens to hundreds of cycles and the commutation is generally among only a few threads, e.g. two to four. The commutation frequency of a well-bounded latency, based on statistics such as a cache miss, is one-half to ten percent of the shared memory references missed.
Hardware support is also necessary for commutating or switching threads in a cooperative parallel processing system where more than one thread, possibly running on distinct processors, is cooperating to complete a single task. A processor may need to commutate when a value shared among threads is not in the desired state because, for example, another processor is currently working with a shared memory location that the specific processor needs to reference or that the required data is being calculated by another thread in another processor, or that the required data hasn't been verified yet, or that the data is stale. Thus, commutation may be triggered by an explicit synchronization instruction which fails, e.g., compare and swap; or a synchronization operation implicit in an instruction which references memory, such as a hybrid-dataflow. When instruction streams are more tightly coupled, such synchronization operations can occur at boundaries of ten to hundreds of instructions. Unlike the multi-tasking case above, however, the thread triggering commutation may need substantial time to allow computation of the results on which the commutation depends before it resumes. Other events which may trigger commutation of execution resources from one thread to another include expiration of a hardware timer typically set for thousands to millions of clock cycles; or reference to an I/O device with latencies of thousands to millions of clock cycles. Thus, with frequent commutation, poorly bounded resumption, or limited space for a thread state in the processor, hardware support to transfer thread state to and from storage is imperative.
The hardware in the processor core that actually transfers data between the processor core and a memory or other storage are called register/storage units. Data is explicitly transferred into a register/storage unit when an instruction "loads" data into a register within the processor core from memory in preparation for execution or when data is "saved", i.e., transferred from a register in the processor core to memory. Data can also be explicitly transferred by composite instructions such as "atomic save" or "restore" of thread state. An "atomic save" occurs when the data is stored to all visible memory in a multiprocessor system at once and no other stores or loads can impact a specified memory location until the atomic save has been completed. "Restoring" the thread state loads all registers in the processor core with data required by a processor thread state. On the other hand, data values are implicitly transferred by a register/storage unit when instructions utilize an addressing mode which references memory, e.g., add an immediate value of four to the value within a particular storage location and return the result to memory.
A conflict between execution and commutation is created when the same register/storage unit is used for storage operations denoted by an instruction and there is a simultaneous request, where implicit or explicit, to commutate a thread state between the processor core and storage. The conflict is easily resolved for systems using commutation to mask well-bounded latency, such as masking shared memory latency. In those systems and instances, it is usually sufficient to provide storage for several thread states within or close to the processor core. Commutation then need not traverse register/storage units.
For cooperative parallel processing systems, however, that commutate to mask coherency and synchronization delays, the number of threads unable to resume execution may become larger than the storage capacity within or close to the processor. The simplest solution is to suspend execution in order to commutate a thread state but this limits scalability for cooperative parallel processing. With tens to hundreds of registers, synchronization frequency of tens to hundreds of processor cycles and several register/storage units, synchronization can impose severe performance penalties, up to fifty percent. Suspending a processor's execution, moreover, restricts the size of a thread that parallelism gained by being able to execute multiple threds.
So an alternative solution to mask latency resulting from the coherency and synchronization requirement is to utilize additional register/storage units dedicated to commutation and buffering of state being commutated. As an example only, distinct hardware can provide for several threads: a thread state streaming out of the processor; a thread state actually executing; and a thread state streaming into the processor. Buffering these additional thread states reduces the performance impact of variances in the mean time between commutation. The RapidGraph parallel processor developed at Carnegie Mellon University and the Strand system from Philip's Laboratory are examples of such parallel machines. The static allocation of register/storage units to commutation, as in these machines, however precludes those register/storage units from being used to meet peak requirements of either execution or commutation. Conceptually similar, the functional units within the processor core of the Cray 2 computer transfer thread states between a register file and memory. Specific subsets of the register state are statically assigned to commutate threads to memory whenever a particular functional unit is not explicitly assigned a load or store operation by an instruction stream. Likewise, U.S. Pat. No. 5,404,469 entitled, "Multi-threaded Microprocessor Architecture Utilizing Static Interleaving," to Chung et al. describes a processing system wherein the transfers of information between registers and memory are fixed and predetermined in time and in hardware. Because of all of these fixed and predetermined allocations, there can be no dynamic relationship between the register/storage units and the registers.