1. Field of the Invention
The present invention relates generally to the field of processor or computer design and operation. In one aspect, the present invention relates to pipelined instruction and data operations in a multithreaded processor.
2. Description of the Related Art
Computer systems are constructed of many components, typically including one or more processors that are connected for access to one or more memory devices (such as RAM) and secondary storage devices (such as hard disks and optical discs). For example, FIG. 1 is a diagram illustrating a computer system 10 with multiple memories. Generally, a processor 1 connects to a system bus 12. Also connected to the system bus 12 is a memory (e.g., 14). During processor operation, CPU 2 processes instructions and performs calculations. Data for the CPU operation is stored in and retrieved from memory using a memory controller 8 and cache memory, which holds recently or frequently used data or instructions for expedited retrieval by the CPU 2. Specifically, a first level (L1) cache 4 connects to the CPU 2, followed by a second level (L2) cache 6 connected to the L1 cache 4. The CPU 2 transfers information to the L2 cache 6 via the L1 cache 4. Such computer systems may be used in a variety of applications, including as a server 10 that is connected in a distributed network, such as Internet 9, enabling server 10 to communicate with clients A-X, 3, 5, 7.
Because processor clock frequency is increasing more quickly than memory speeds, there is an ever increasing gap between processor speed and memory access speed. In fact, memory speeds have only been doubling every six years-one-third the rate of microprocessors. In many commercial computing applications, this speed gap results in a large percentage of time elapsing during pipeline stalling and idling, rather than in productive execution, due to cache misses and latency in accessing external caches or external memory following the cache misses. Stalling and idling are most detrimental, due to frequent cache misses, in database handling operations such as OLTP, DSS, data mining, financial forecasting, mechanical and electronic computer-aided design (MCAD/ECAD), web servers, data servers, and the like. Thus, although a processor may execute at high speed, much time is wasted while idly awaiting data.
One technique for reducing stalling and idling is hardware multithreading to achieve processor execution during otherwise idle cycles. FIGS. 2a and 2b show two timing diagrams illustrating an execution flow 22 in a single-thread processor and an execution flow 24 in a vertical multithread processor. Processing applications, such as database applications and network computing applications spend a significant portion of execution time stalled awaiting memory servicing. This is illustrated in FIG. 2a, which depicts a highly schematic timing diagram showing execution flow 22 of a single-thread processor executing a database application. The areas within the execution flow 22 labeled as “C” correspond to periods of execution in which the single-thread processor core issues instructions. The areas within the execution flow 22 labeled as “M” correspond to time periods in which the single-thread processor core is stalled waiting for data or instructions from memory or an external cache. A typical single-thread processor executing a typical database application executes instructions about 25% of the time with the remaining 75% of the time elapsed in a stalled condition. The 25% utilization rate exemplifies the inefficient usage of resources by a single-thread processor.
FIG. 2b is a highly schematic timing diagram showing execution flow 24 of similar database operations by a multithread processor. Applications, such as database applications, have a large amount inherent parallelism due to the heavy throughput orientation of database applications and the common database functionality of processing several independent transactions at one time. The basic concept of exploiting multithread functionality involves using processor resources efficiently when a thread is stalled by executing other threads while the stalled thread remains stalled. The execution flow 24 depicts a first thread 25, a second thread 26, a third thread 27 and a fourth thread 28, all of which are labeled to show the execution (C) and stalled or memory (M) phases. As one thread stalls, for example first thread 25, another thread, such as second thread 26, switches into execution on the otherwise unused or idle pipeline. There may also be idle times (not shown) when all threads are stalled. Overall processor utilization is significantly improved by multithreading. The illustrative technique of multithreading employs replication of architected registers for each thread and is called “vertical multithreading.”
Vertical multithreading is advantageous in processing applications in which frequent cache misses result in heavy clock penalties. When cache misses cause a first thread to stall, vertical multithreading permits a second thread to execute when the processor would otherwise remain idle. The second thread thus takes over execution of the pipeline. A context switch from the first thread to the second thread involves saving the useful states of the first thread and assigning new states to the second thread. When the first thread restarts after stalling, the saved states are returned and the first thread proceeds in execution. Vertical multithreading imposes costs on a processor in resources used for saving and restoring thread states, and may involve replication of some processor resources, for example replication of architected registers, for each thread. In addition, vertical multithreading can overwhelm the instruction fetching capabilities of a pipelined processor when a load miss forces a related instruction in a thread to be re-fetched. In particular, many processors speculate that a load request will hit in the cache in order to minimize bubbles in the pipeline. If the load misses in the cache, a flush is typically required in order to ensure the correct update of architectural state. For a single thread machine, it is very likely that the instruction after the load will be in the pipeline when the load miss is detected. If the load is in the pipeline, then a load flush is required of the instructions past the load, and some replay mechanism of the instruction after the load is required. When a replay mechanism is used to re-fetch the instruction after the load, this reduces the overall fetch bandwidth of the processor since the same instructions must be re-fetched.
Accordingly, an improved method and system for handling cache misses in a multithreaded processor are needed that are economical in resources and avoid costly overhead which reduces processor performance. In addition, an efficient processor protocol is needed that reduces or eliminates unnecessary pipeline flush operations to improve the overall fetch bandwidth of the processor. There is also a need for a method and system that efficiently handles first level (L1) data cache misses without imposing unneeded instruction re-fetch penalties, especially for use in highly threaded processor applications. Further limitations and disadvantages of conventional systems will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.