1. Field
The present invention generally relates to the design of processors within computer systems. More specifically, the present invention relates to a processor with a store queue, which supports store-merging and provides forward-progress guarantees for threads.
2. Related Art
Advances in semiconductor fabrication technology have given rise to dramatic increases in microprocessor clock speeds. This increase in microprocessor clock speeds has not been matched by a corresponding increase in memory access speeds. Hence, the disparity between microprocessor clock speeds and memory access speeds continues to grow, and is beginning to create significant performance problems. Execution profiles for fast microprocessor systems show that a large fraction of execution time is spent not within the microprocessor core, but within memory structures outside of the microprocessor core. This means that the microprocessor systems spend a large fraction of time waiting for memory references to complete instead of performing computational operations.
Efficient caching schemes can help reduce the number of memory accesses that are performed. However, when a memory reference, such as a load, generates a cache miss, the subsequent access to level-two (L2) cache or memory can require dozens or hundreds of clock cycles to complete, during which time the processor is typically idle, performing no useful work.
In contrast, cache misses during stores typically do not affect processor performance as much because the processor usually places the stores into a “store queue” and continues executing subsequent instructions. Existing store queue designs typically maintain an array of pending stores in program order. Note that some of these pending stores can possibly be directed to the same dataword in the same cache line. In particular, if consecutive stores are directed to the same dataword, these stores can be effectively merged into a single entry in the store queue without violating a conventional memory model, such as the Total-Store-Order (TSO) memory model. This merging can effectively reduce the memory bandwidth because the number of memory accesses is reduced.
However, when “non-consecutive” stores (that is, stores that are separated, in program order, by one or more stores by the same thread to a different dataword) directed to a same dataword are pending in a store queue, these non-consecutive stores to the same dataword typically cannot be merged without violating a conventional memory model, such as TSO. TSO is violated because merging non-consecutive stores effectively reorders the stores with respect to other intervening memory accesses.
This problem can be mitigated by “store-marking” cache lines to indicate that one or more store queue entries are waiting to be committed to the cache lines, and then delaying accesses to the store-marked cache lines by other threads. In this way, stores to a given cache line can be reordered, thereby allowing non-consecutive stores to be merged without violating TSO.
However, when multiple threads are store-marking cache lines, it is hard to ensure that a given thread makes forward progress, because it cannot be guaranteed that the given thread will successfully acquire a store-mark on a needed cache line.
Hence, what is needed is a method and an apparatus for ensuring forward progress for threads in a system which supports store-merging in a store queue.