1. Field of the Invention
The invention relates to the field of computer architecture. More specifically, the invention relates to accelerating store operations in a shared memory multiprocessor system.
2. Description of the Related Art
Commercial application such as online transaction processing (OLTP), web servers and application servers represent a crucial market segment for shared-memory multiprocessor servers. Numerous studies have shown that many of these applications are characterized by large instruction and data footprints which cause high cache miss rates that result in high Cycles Per Instruction (CPI). While there has been substantial research to improve their instruction miss rates as well as to mitigate the performance impact of their high load miss rates, there has been little research into the performance impact of stores in commercial applications. Stores constitute a very significant percentage of the dynamic instruction count of commercial applications and store miss rates are comparable to, if not higher than, load miss rates and instruction miss rates.
Off-chip store misses (i.e., instruction and data accesses that miss the on-chip caches) require long latency accesses to either main memory or an off-chip cache and are particularly expensive. The performance impact of off-chip store misses depends on the degree of their overlap with computation and with other off-chip misses.
The majority of off-chip store misses cannot be overlapped with computation. Store handling optimizations that improve the number of overlapping store misses handled in parallel (hereinafter referred to as store memory-level parallelism), such as store prefetching, have been illustrated to be critical in mitigating the performance impact of these misses. While store prefetching is demonstrated to be effective, it consumes substantial L2 cache bandwidth, which will be a precious resource in future aggressive chip multi-processors. Hence, even with store prefetching, the performance impact of off-chip store misses is not fully mitigated
Another technique to mitigate the performance impact of store misses is to increase size of the store queue and the store buffer. When the store queue is full, the processor must stop retirement as soon as the next instruction to retire is a store. At that point, the reorder buffer as well as the store buffer can no longer drain, so they begin to fill up. When the reorder buffer is full or when the processor tries to dispatch a store and the store buffer is full, the processor can no longer dispatch any more instructions. Eventually, the processor pipeline is stalled and remains so until the store queue drains. Thus, a missing store at the head of the store queue can stall the processor for several hundred cycles. However, there are limits to increasing the store buffer and store queue sizes. If an intra-processor data dependence exists between an uncommitted store and a subsequent load, the memory contains a stale value. The processor must detect this dependence and either deliver the value from the store buffer/queue or stall the load until the value is committed. Thus, every load must associatively search the store buffer/queue and ascertain that there are no prior stores with matching addresses. In order to detect store-load dependences, the store buffer/queue are traditionally implemented using a Content Addressable Memory (CAM) structure. The CAM nature of these two structures place a limit on how much they can be enlarged before they impact the processor's clock frequency target.
The performance of stores is also impacted by the memory consistency model implemented by the processor. Previous detailed studies on memory consistency models were performed using scientific workloads rather than commercial workloads and most were focused on the performance differences between sequential consistency and release consistency. However, none of the four remaining server processor instruction set architectures implement the sequential consistency model. They either implement variations of processor consistency (INTEL/AMD X86 and X64, SUN SPARC TSO), weak consistency (IBM POWERPC, SUN SPARC RMO) or release consistency (INTEL IA-64). These memory consistency models place ordering constraints that greatly affect the impact of stores misses. In particular, the processor consistency models used by highest-volume server processors require an in-order commit of stores. A straightforward implementation that satisfies this requirement holds up all later stores while a store miss is handled.
In addition to the challenges of increasing CAM implemented store buffers/queues, increasing store buffer and store queue sizes is less effective in improving memory-level parallelism than previously assumed, especially in commercial application and with commercial workloads. Serializing instructions (e.g., casa and membar in the SPARC ISA) remain as major impediments. Most of these serializing instructions occur in the lock acquire and lock release of critical sections. On encountering a serializing instruction, the store buffer/queue has to be drained of all prior store misses in the processor consistency model. While increasing store buffer/queue sizes may be effective in alleviating the impact of bursty stores and store misses, it does not address the impact of store misses followed by serializing instructions.
Effective mechanisms to mitigate the large performance impact of store misses in current processor architectures and systems that do not use excessive bandwidth for hardware store prefetching or do not implement very large store buffers/queues that impose cycle time constraints is desirable.