1. Field of the Invention
The present invention relates to the design of processors within computer systems. More specifically, the present invention relates to an efficient store queue architecture, which holds pending stores, and applies the stores to a memory subsystem in program order.
2. Related Art
Advances in semiconductor fabrication technology have given rise to dramatic increases in microprocessor clock speeds. This increase in microprocessor clock speeds has not been matched by a corresponding increase in memory access speeds. Hence, the disparity between microprocessor clock speeds and memory access speeds continues to grow, and is beginning to create significant performance problems. Execution profiles for fast microprocessor systems show that a large fraction of execution time is spent not within the microprocessor core, but within memory structures outside of the microprocessor core. This means that the microprocessor systems spend a large fraction of time waiting for memory references to complete instead of performing computational operations.
Efficient caching schemes can help reduce the number of memory accesses that are performed. However, when a memory reference, such as a load, generates a cache miss, the subsequent access to level-two (L2) cache or memory can require dozens or hundreds of clock cycles to complete, during which time the processor is typically idle, performing no useful work.
In contrast, cache misses during stores typically do not affect processor performance as much because the processor usually places the stores into a “store queue” and continues executing subsequent instructions. However, as computer system performance continues to increase, store queues need to become larger to accommodate relatively larger memory latencies.
Unfortunately, as store queues become larger, it is no longer practical to use conventional store queue designs. Conventional store queue designs typically maintain an array of stores in program order, and provide circuitry to match every incoming load against the array of stores. They also provide circuitry to produce the value of every byte being read from the last written value to that byte in the store queue, which may involve accessing entries for multiple stores. The above-described circuitry increases the complexity of the store queue which becomes a problem as the store queue increases in size.
Some researchers have investigated two-level store queue implementations in which a larger, second level store queue is implemented in RAM and is searched linearly whenever a Bloom filter indicates that a hit may be possible. For example, see [Akkaray03] Akkary, Rajwar and Srinivasan, “Checkpoint Processing and Recovery: An Efficient, Scalable Alternative to Reorder Buffers,” IEEE Micro, vol. 23, no. 6, pp. 11-19, 2003. Although this two-level store queue is area-efficient, it is also very slow.
Other researchers have investigated using an L1 (level-one) data cache to hold store values before they are applied to the memory subsystem. For example, this technique is described in [Gandhi05] Gandhi, Akkary, Rajwar, Srinivasan and Lai, “Scalable Load and Store Processing in Latency Tolerant Processors,” Intl. Symposium on Computer Architecture, pp. 446-457, 2005. Unfortunately, this technique decreases the performance of the data cache, because the data cache must hold all of the buffered stores. It also requires a dedicated data cache per strand. Otherwise, further degradation of data cache performance will occur because other strands cannot see the stores until they are removed from the store queue and applied to the memory subsystem—if a memory model such as total store ordering (TSO) is to be supported.
Hence, what is needed is an efficient and practical store queue design which can accommodate larger numbers of stores without the above-described problems.