The present invention relates in general to an improved data processing system and in particular to an improved system and method for switching threads of execution when execution of a thread is stalled in the dispatch stage of a multithread pipelined processor and flushing the stalled thread from earlier stages of pipeline.
From the standpoint of the computer""s hardware, most systems operate in fundamentally the same manner. Computer processors actually perform very simple operations quickly, such as arithmetic, logical comparisons, and movement of data from one location to another. What is perceived by the user as a new or improved capability of a computer system, however, may actually be the machine performing the same simple operations at very high speeds. Continuing improvements to computer systems require that these processor systems be made ever faster.
One measurement of the overall speed of a computer system, also called the throughput, is measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, particularly the clock speed of the processor. So that if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Computer processors which were constructed from discrete components years ago performed significantly faster by shrinking the size and reducing the number of components; eventually the entire processor was packaged as an integrated circuit on a single chip. The reduced size made it possible to increase the clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems still exists. Hardware designers have been able to obtain still further improvements in speed by greater integration, by further reducing the size of the circuits, and by other techniques. Designers, however, think that physical size reductions cannot continue indefinitely and there are limits to continually increasing processor clock speeds. Attention has therefore been directed to other approaches for further improvements in overall throughput of the computer system.
Without changing the clock speed, it is still possible to improve system speed by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this practical. The use of slave processors considerably improves system speed by off-loading work from the central processing unit (CPU) to the slave processor. For instance, slave processors routinely execute repetitive and single special purpose programs, such as input/output device communications and control. It is also possible for multiple CPUs to be placed in a single computer system, typically a host-based system which serves multiple users simultaneously. Each of the different CPUs can separately execute a different task on behalf of a different user, thus increasing the overall speed of the system to execute multiple tasks simultaneously.
Coordinating the execution and delivery of results of various functions among multiple CPUs is a tricky business; not so much for slave I/O processors because their functions are pre-defined and limited but it is much more difficult to coordinate functions for multiple CPUs executing general purpose application programs. System designers often do not know the details of the programs in advance. Most application programs follow a single path or flow of steps performed by the processor. While it is sometimes possible to break up this single path into multiple parallel paths, a universal application for doing so is still being researched. Generally, breaking a lengthy task into smaller tasks for parallel processing by multiple processors is done by a software engineer writing code on a case-by-case basis. This ad hoc approach is especially problematic for executing commercial transactions which are not necessarily repetitive or predictable.
Thus, while multiple processors improve overall system performance, it is much more difficult to improve the speed at which a single task, such as an application program, executes. If the CPU clock speed is given, it is possible to further increase the speed of the CPU, i.e., the number of operations executed per second, by increasing the average number of operations executed per clock cycle. A common architecture for high performance, single-chip microprocessors is the reduced instruction set computer (RISC) architecture characterized by a small simplified set of frequently used instructions for rapid execution, those simple operations performed quickly as mentioned earlier. As semiconductor technology has advanced, the goal of RISC architecture has been to develop processors capable of executing one or more instructions on each clock cycle of the machine. Another approach to increase the average number of operations executed per clock cycle is to modify the hardware within the CPU. This throughput measure, clock cycles per instruction, is commonly used to characterize architectures for high performance processors.
Processor architectural concepts pioneered in high performance vector processors and mainframe computers of the 1970s, such as the CDC-6600 and Cray-1, are appearing in RISC microprocessors. Early RISC machines were very simple single-chip processors. As Very Large Scale Integrated (VLSI) technology improves, additional space becomes available on a semiconductor chip. Rather than increase the complexity of a processor architecture, most designers have decided to use the additional space to implement techniques to improve the execution of a single CPU. Two principal techniques utilized are on-chip caches and instruction pipelines. Cache memories store data that is frequently used near the processor and allow instruction execution to continue, in most cases, without waiting the full access time of a main memory. Some improvement has also been demonstrated with multiple execution units with hardware that speculatively looks ahead to find instructions to execute in parallel. Pipeline instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished.
The superscalar processor is an example of a pipeline processor. The performance of a conventional RISC processor can be further increased in the superscalar computer and the Very Long Instruction Word (VLIW) computer, both of which execute more than one instruction in parallel per processor cycle. In these architectures, multiple functional or execution units are connected in parallel to run multiple pipelines. The name implies that these processors are scalar processors capable of executing more than one instruction in each cycle. The elements of superscalar pipelined execution include an instruction fetch unit to fetch more than one instruction at a time from a cache memory, instruction decoding logic to determine if instructions are independent and can be executed simultaneously, and sufficient execution units to execute several instructions at one time. The execution units may also be pipelined, e.g., floating point adders or multipliers may have a cycle time for each execution stage that matches the cycle times for the fetch and decode stages.
In a superscalar architecture, instructions may be completed in-order and out-of-order. In-order completion means no instruction can complete before all instructions dispatched ahead of it have been completed. Out-of-order completion means that an instruction is allowed to complete before all instructions ahead of it have been completed, as long as a predefined rules are satisfied. Within a pipeline superscalar processor, instructions are first fetched, decoded and then buffered. Instructions can be dispatched to execution units as resources and operands become available. Additionally, instructions can be fetched and dispatched speculatively based on predictions about branches taken. The result is a pool of instructions in varying stages of execution, none of which have completed by writing final results. As resources become available and branches are resolved, the instructions are xe2x80x9cretiredxe2x80x9d in program order thus preserving the appearance of a machine that executes the instructions in program order.
A superscalar processor tracks, or manages, instructions that have been speculatively executed typically utilizing a completion buffer. Each executed instruction in the buffer is associated with its results generally stored in renamed registers and any exception flags. A problem arises, however, when instructions are executed out of order and in particular when one of the instructions encounters an error condition. The processor must stop at the instruction that has an error because that instruction may affect subsequent instructions and the machine state.
For both in-order and out-of-order completion of instructions in superscalar systems, these pipelines stop and stall under certain circumstances. An instruction that is dependent upon the results of a previously dispatched instruction that has not yet completed may cause the pipeline to stall. For instance, instructions dependent on a load/store instruction in which the necessary data is not in the cache, i.e., a cache miss, cannot be completed until the data becomes available in the cache.
Another technique called hardware multithreading is to independently execute smaller sequences of instructions called threads or contexts in a single processor. When a CPU, for any of a number of reasons, stalls and cannot continue processing or executing one of these threads, the CPU switches to and executes another thread. The term xe2x80x9cmultithreadingxe2x80x9d as defined in the computer architecture community is not the same as the software use of the term in which one task is subdivided into multiple related threads. Software multithreading substantially involves the operating system which manipulates and saves data from registers to main memory and maintains the program order of related and dependent instructions before a thread switch can occur. Software multithreading does not require nor is it concerned with hardware multithreading and vice versa. Hardware multithreading manipulates hardware architected registers and execution units and pipelined processors to maintain the state of one or more independently executing sets of instructions, called threads, in the processor hardware. Threads could be derived from, for example, different tasks in a multitasking system, different threads compiled from a software multithreading system, or from different I/O processors. What makes hardware multithreading unique and different from all these systems, however, is that more than one thread is independently maintained in a processor""s registers.
Hardware multithreading takes on a myriad of forms. Multithreading permits processors having either non-pipeline or pipelined architectures to do useful work on more than one thread in the processor""s registers. One form of multithreading, sometimes referred to as coarse-grained multithreading, is to execute one thread until the executing thread experiences a long latency event, such as retrieving data and/or instructions from memory or a processor interrupt, etc. Fine-grained multithreading, on the other hand, interleaves or switches threads on a cycle-by-cycle basis. Simultaneous hardware multithreading maintains N threads, or N states, in parallel in the processor and simultaneously executes N threads in parallel. Replicating processor registers for each of N threads results in some of the following registers being replicated N times: general purpose registers, floating point registers, condition registers, floating point status and control registers, count registers, link registers, exception registers, save/restore registers, special purpose registers, etc. Special buffers, such as a segment lookaside buffer, may be replicated but if not; each entry can be tagged with the thread number and flushed on every thread switch. Also, some branch prediction mechanisms, e.g., the correlation register and the return stack, may also be replicated. Multithreading may also take on features of one or all of the forms, picking and choosing particular features for particular attributes, for instance, not all of the processor""s features need be replicated for each thread and there may be some shared and some replicated registers and/or there may be some separate parallel stages in the pipeline or there may be other shared stages of the pipeline. Fortunately, there is no need to replicate some of the larger functions of the processor such as: level one instruction cache, level one data cache, instruction buffer, store queue, instruction dispatcher, functional or execution units, pipelines, translation lookaside buffer (TLB), and branch history table. A problem exists, however, when there are separate or private queues or other resources and the processor has some shared pipeline stages; thus, when an instruction stalls at the shared stage, not only does its thread stall at that shared stage but also other threads in or after the shared stage are also stalled and cannot execute.
There is thus a need in the industry to enhance simultaneous multithreading by efficiently flushing a stalled thread in shared and earlier pipeline stages of a multithread processor.
These needs and others that will become apparent to one skilled in the art are satisfied by a method to flush one of a plurality of threads in a processor pipeline of a multithreaded computer processor, comprising the steps of: (a) fetching a plurality of threads for simultaneous processing in the multithread processor having at least one shared pipeline stage; (b) recognizing a stalled instruction in the shared pipeline stage which prevents further processing of at least two threads present in the processor pipeline in which the stalled instruction belongs to one of the threads present in the processor pipeline of the processor pipeline, (c) flushing all instructions of the one thread having the stalled instruction from the shared pipeline stage and all stages in the processor pipeline prior to the shared pipeline stage; and (d) processing another of the threads in the processor pipeline. The stalled instruction may be stalled because a private resource required by the stalled thread is blocked. The instruction may be stalled because a non-renamed register needed by the stalled instruction is blocked. The stalled instruction may be an instruction requiring synchronized load/storage operations and the operations are delayed. The stalled instruction may be a first instruction of a group of microcoded instructions.
The method may further comprise determining if any other instructions associated with the thread having the stalled instruction has any other associated flush conditions. The method may further comprise determining if the stalled instruction of the one of a plurality of threads or the other instruction having an associated flush condition should be flushed first; and first flushing the other instruction having the associated flush condition and in accordance with the associated flush condition because the other instruction is older.
The method also encompasses that the shared pipeline stage is a dispatch stage. Under these circumstances, the method may also comprise removing the instructions of the thread associated with the stalled instruction from a decode pipeline. The method may also comprise removing the instructions of the thread associated with the stalled instruction from an instruction buffer. The method may also comprise restarting the thread whose instructions were flushed from the dispatch stages and all stages in the processor pipeline prior to the dispatch stage. The step of restarting may be delayed until a condition which causes the stalled instruction to stall at the dispatch stage is resolved. The step of restarting the flushed thread may also be delayed until a number of the processor cycles have passed.
The invention may further be considered a dispatch flush mechanism in a hardware multithread pipeline processor, the pipeline processor simultaneously processing more than one thread, the dispatch flush mechanism comprising: a fetch stage of the pipeline processor; a decode stage of the pipeline processor connected to the fetch stage; a dispatch stage of the pipeline processor connected to the decode stage; flush prioritize logic connected to the fetch stage, the decode stage, and the dispatch stage of the pipeline processor; an issue queue of the pipeline processor connected to the dispatch stage; a plurality of private resources dedicated to each of the threads connected to the issue queue; and thread select logic within the pipeline processor, wherein one of the plurality of instructions of one of the plurality of threads is prevented from dispatching into an issue queue because one of the plurality of private resources is unavailable and wherein the thread select logic selects all instructions in the dispatch stage, the decode stage, and the fetch stage belonging to the one of the plurality of threads and the flush prioritize logic issues a signal to remove the all instructions.
The invention is also considered an apparatus to enhance processor efficiency, comprising: means to fetch instructions from a plurality of threads into a hardware multithreaded pipeline processor; means to distinguish the instructions into one of a plurality of threads; means to decode the instructions; means to determine if the instructions have sufficient private and shared resources for dispatching the instructions; and means to dispatch the instructions; means to remove all of the instructions of the one of the plurality of threads from the fetching means and the decoding means and the dispatching means when the determining means determines that one of the instructions of the one of a the plurality of threads does not have sufficient private resources for the dispatch means to dispatch the instruction.
The invention is also a computer processing system, comprising a central processing unit; a semiconductor memory unit attached to the central processing unit; at least one memory drive capable of having removable memory; a keyboard/pointing device controller attached to the central processing unit for attachment to a keyboard and/or a pointing device for a user to interact with the computer processing system; a plurality of adapters connected to the central processing unit to connect to at least one input/output device for purposes of communicating with other computers, networks, peripheral devices, and display devices; and a hardware multithreading pipelined processor within the central processing unit to process at least two independent threads of execution. The hardware multithreading pipelined processor comprises a fetch stage, a decode stage, and a dispatch stage; an instruction stall detector to detect when a resource required in the dispatch stage by an instruction of one of the threads is unavailable and hence the instruction is stalled in the dispatch stage; flush decode logic to determine if the thread having the stalled instruction has a previous flush condition, a dispatch flush mechanism to flush the thread having the stalled instruction from the fetch stage, the decode stage, and the dispatch stage if no other previous flush condition exists or if the previous flush condition has a lower priority than the stalled instruction so that the processor can process another of the independent threads of execution with the processor pipeline.