1. Field
The described embodiments relate to the design of processors within computer systems. More specifically, the described embodiments include a processor with a store queue that provides bounded-time responses to read-after-write (RAW) bypasses and forward-progress requests for threads.
2. Related Art
Advances in semiconductor fabrication technology have given rise to dramatic increases in microprocessor clock speeds. Unfortunately, this increase in microprocessor clock speeds has not been matched by a corresponding increase in memory access speeds. Hence, the disparity between microprocessor clock speeds and memory access speeds continues to grow, and is beginning to create significant performance problems. Execution profiles for fast microprocessor systems show that a large fraction of execution time is spent not within the microprocessor core, but within memory structures outside of the microprocessor core. This means that the microprocessor systems spend a large fraction of time waiting for memory references to complete instead of performing computational operations.
Efficient caching schemes can help reduce the number of memory accesses that are performed. However, when a load generates a cache miss, the subsequent access to level-two (L2) cache or memory can require dozens or hundreds of clock cycles to complete, during which time the processor is typically idle, performing no useful work. In contrast, cache misses during stores typically do not affect processor performance as much because the processor usually places the stores into a “store queue” and continues executing subsequent instructions. Existing store queue designs typically maintain an array of pending stores in program order.
Some existing store queue designs place a “store-mark” on a cache line to indicate that one or more store queue entries include pending stores that are to be committed to the cache line, and then delay accesses to the store-marked cache lines by other threads. In some of these designs, one or more of the pending stores can be directed to data-words in the same cache line. If consecutive stores are directed to data-words in the same cache line, these stores can be merged in the store queue without violating a conventional memory model, such as the Total-Store-Order (TSO) memory model. In this way, stores to a given cache line can be reordered, thereby allowing non-consecutive stores to be merged without violating TSO. This merging can reduce the memory bandwidth because the number of memory accesses is reduced.
When multiple threads are store-marking cache lines, some existing processors traverse a list of stores (e.g., the store queue) in order to respond to requests for cache lines by other threads, e.g., forward progress read-to-own (FPRTO) coherence requests and/or read-after-write (RAW) bypass requests. Traversing a list to locate data for stores can be time consuming.