1. Field of the Invention
The present invention relates to the design of processors within computer systems. More specifically, the present invention relates to a technique that facilitates reordering store instructions through cacheline marking.
2. Related Art
Advances in semiconductor fabrication technology have given rise to dramatic increases in microprocessor clock speeds. This increase in microprocessor clock speeds has not been matched by a corresponding increase in memory access speeds. Hence, the disparity between microprocessor clock speeds and memory access speeds continues to grow, and is beginning to create significant performance problems. Execution profiles for fast microprocessor systems show that a large fraction of execution time is spent not within the microprocessor core, but within memory structures outside of the microprocessor core. This means that the microprocessor systems spend a large fraction of time waiting for memory references to complete instead of performing computational operations.
Efficient caching schemes can help reduce the number of memory accesses that are performed. However, when a memory reference, such as a load, generates a cache miss, the subsequent access to level-two (L2) cache or memory can require dozens or hundreds of clock cycles to complete, during which time the processor is typically idle, performing no useful work.
In contrast, cache misses during stores typically do not affect processor performance as much because the processor usually places the stores into a “store queue” and continues executing subsequent instructions. Existing store queue designs typically maintain an array of pending stores in program order. Note that some of these pending stores are directed to a same word in a same cacheline. In particular, if consecutive stores are directed to a same word, these stores can be effectively merged into a single entry in the store queue without violating a conventional memory model, such as the Total-Store-Order (TSO) memory model. This merging can effectively reduce the memory bandwidth because the number of memory accesses is reduced.
However, when “non-consecutive” stores (that is, stores that are separated, in program order, by one or more stores by the same thread to a different word) directed to a same word are pending in a store queue, these non-consecutive stores to the same word typically cannot be merged without violating a conventional memory model, such as TSO. TSO is violated because merging non-consecutive stores effectively reorders the stores with respect to other intervening memory accesses.
This inability to reorder stores also gives rise to other performance problems. For example, non-consecutive stores to the same cacheline cannot be reordered and hence cannot be combined to reduce traffic to memory.
Furthermore, the inability to reorder stores may force the store queue to maintain ordering information between all stores that it contains, thus complicating its design.
Hence, what is needed is a method and apparatus that facilitates reordering stores to overcome the above-described problems.