1. Field of the Invention
The present invention relates generally to the field of processor or computer design and operation. In one aspect, the present invention relates to memory operations in a multithreaded processor.
2. Description of the Related Art
Computer systems are constructed of many components, typically including one or more processors that are connected for access to one or more memory devices (such as RAM) and secondary storage devices (such as hard disks and optical discs). For example, FIG. 1 is a diagram illustrating a computer system 10 with multiple memories. Generally, a processor 1 connects to a system bus 12. Also connected to the system bus 12 is a memory (e.g., 14). During processor operation, CPU 2 processes instructions and performs calculations. Data for the CPU operation is stored in and retrieved from memory using a memory controller 8 and cache memory, which holds recently or frequently used data or instructions for expedited retrieval by the CPU 2. Specifically, a first level (L1) cache 4 connects to the CPU 2, followed by a second level (L2) cache 6 connected to the L1 cache 4. The CPU 2 transfers information to the L2 cache 6 via the L1 cache 4. Such computer systems may be used in a variety of applications, including as a server 10 that is connected in a distributed network, such as Internet 9, enabling server 10 to communicate with clients A-X, 3, 5, 7.
Because processor clock frequency is increasing more quickly than memory speeds, there is an ever increasing gap between processor speed and memory access speed. In fact, memory speeds have only been doubling every six years-one-third the rate of microprocessors. In many commercial computing applications, this speed gap results in a large percentage of time elapsing during pipeline stalling and idling, rather than in productive execution, due to cache misses and latency in accessing external caches or external memory following the cache misses. Stalling and idling are most detrimental, due to frequent cache misses, in database handling operations such as OLTP, DSS, data mining, financial forecasting, mechanical and electronic computer-aided design (MCAD/ECAD), web servers, data servers, and the like. Thus, although a processor may execute at high speed, much time is wasted while idly awaiting data.
One technique for reducing stalling and idling is hardware multithreading to achieve processor execution during otherwise idle cycles. FIGS. 2a and 2b show two timing diagrams illustrating an execution flow 22 in a single-thread processor and an execution flow 24 in a vertical multithread processor. Processing applications, such as database applications and network computing applications, spend a significant portion of execution time stalled awaiting memory servicing. This is illustrated in FIG. 2a, which depicts a highly schematic timing diagram showing execution flow 22 of a single-thread processor executing a database application. The areas within the execution flow 22 labeled as “C” correspond to periods of execution in which the single-thread processor core issues instructions. The areas within the execution flow 22 labeled as “M” correspond to time periods in which the single-thread processor core is stalled waiting for data or instructions from memory or an external cache. A typical single-thread processor executing a typical database application executes instructions about 25% of the time with the remaining 75% of the time elapsed in a stalled condition. The 25% utilization rate exemplifies the inefficient usage of resources by a single-thread processor.
FIG. 2b is a highly schematic timing diagram showing execution flow 24 of similar database operations by a multithread processor. Applications, such as database applications, have a large amount inherent parallelism due to the heavy throughput orientation of database applications and the common database functionality of processing several independent transactions at one time. The basic concept of exploiting multithread functionality involves using processor resources efficiently when a thread is stalled by executing other threads while the stalled thread remains stalled. The execution flow 24 depicts a first thread 25, a second thread 26, a third thread 27 and a fourth thread 28, all of which are labeled to show the execution (C) and stalled or memory (M) phases. As one thread stalls, for example first thread 25, another thread, such as second thread 26, switches into execution on the otherwise unused or idle pipeline. There may also be idle times (not shown) when all threads are stalled. Overall processor utilization is significantly improved by multithreading. The illustrative technique of multithreading employs replication of architected registers for each thread and is called “vertical multithreading.”
Vertical multithreading is advantageous in processing applications in which frequent cache misses result in heavy clock penalties. When cache misses cause a first thread to stall, vertical multithreading permits a second thread to execute when the processor would otherwise remain idle. The second thread thus takes over execution of the pipeline. A context switch from the first thread to the second thread involves saving the useful states of the first thread and assigning new states to the second thread. When the first thread restarts after stalling, the saved states are returned and the first thread proceeds in execution. Vertical multithreading imposes costs on a processor in resources used for saving and restoring thread states, and may involve replication of some processor resources, for example replication of architected registers, for each thread. In addition, vertical multithreading can overwhelm the processor core and/or memory system as stores are generated more quickly by the processor pipeline than can be processed by the cache or memory system.
The use of store buffers in processors is a common technique to improve performance and handle store operations issued by the processor. By buffering stores to the cache or memory, a program can continue to execute while waiting for the stores to issue to the cache or memory. Without the buffer, if the program was waiting on a store, it would be unable to perform another store and execution would halt. When using a store buffer, care must be taken to prevent the buffer from overflowing because a buffer overflow can cause instructions to be lost. At a basic level, this requires that no store instructions be issued when the store buffer is full. In pipeline processor applications, management of the buffer is complicated by the fact that the store buffer commit point in the pipeline is typically later in the pipeline than the final stall point in the pipeline. As a result, there can be stores in the pipeline, but not yet in the store buffer. Conventional store buffer solutions have provided a buffer count feedback from the buffer to the front of the pipeline, though such solutions can cause wiring congestion, create timing problems and force pipeline flushes. In particular, the use of a buffer count feedback signal requires multiple wires for each thread in the store buffer. Not only must additional time be provided to receive and process the buffer count feedback signal, but the circuit size and cost are increased. When the store buffer is located on the physical die at a distance away from the pipeline front end (e.g., the instruction fetch unit circuitry), the timing layout requirements are aggravated. Another conventional solution is to implement a high water mark strategy, though this generally results in low buffer utilization because it assumes all instructions in the pipeline are stores.
Accordingly, improved memory operations for multithreading and/or multiprocessor circuits and operating methods are needed that are economical in resources and avoid costly overhead which reduces processor performance. In addition, an efficient store buffer protocol is needed that maximizes the use of store buffer entries while keeping communication between the store buffers and the front end of the pipeline to a minimum. There is also a need for a store buffer method and system that efficiently processes store buffer entries without requiring elaborate feedback techniques to prevent overflow, especially for use in highly threaded processor applications where there the number of stores is increased. An improved store buffer management system and methodology is needed that minimizes the store buffer size, operations time and power usage. Further limitations and disadvantages of conventional systems will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.