1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficient handling of store misses corresponding to multiple threads.
2. Description of the Relevant Art
Modern microprocessors typically buffer retired store instructions that have yet to write data to a memory subsystem. A store queue (SQ) is a hardware structure configured to buffer retired store instructions, or memory write operations. A particular store instruction is generally held in this structure from the point-in-time the store instruction is retired in a pipeline to the point-in-time it is known that the store instruction has been processed by the memory subsystem such that the corresponding data of the store instruction is globally visible to all processors and threads within the system.
Modern multi-core processors typically are coupled to a memory hierarchy including on-chip caches. Each core typically contains a relatively small level 1 (L1) cache in order to maintain access latencies as low as possible. Multiple cores may share one or more larger level 2 or 3 (L2 or L3) caches. Since a die will contain multiple copies of a L1 cache, it is desirable to make the L1 cache as small and simple as possible. One common method for removing complexity from the L1 cache is to make it a write-through cache and to forsake the responsibility of managing memory request ordering and cache coherency issues to higher level (L2 or L3) caches.
A multi-threaded shared memory system may be thought of as a single atomic memory on which multiple apparently sequential threads are operating. An apparently sequential thread is a single thread in isolation, which behaves as if it is running sequentially. This type of execution implies a few constraints such as a store instruction cannot be reordered with respect to another load or store instruction corresponding to a same memory location, or the illusion of sequential execution is removed. Similarly, the dependencies between branches and subsequent stores need to be respected. For example, if branch prediction occurs, it cannot have an observable effect.
In practice, a multi-threaded shared memory system is very complex with a hierarchy of buffers, caches, random-access-memory (RAM), and disks. Memory consistency models attempt to describe and constrain the behavior of these complex systems. Actions on this complex memory system are serializable, or there is a single serial history of all load and store memory instructions. This single serial history is consistent with the execution behavior of each thread, which accounts for the observed behavior of the program. One example of a memory model is Sequential Consistency, see L. Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs, IEEE Transactions on Computers, September 1979, pp. 690-691. With Sequential Consistency, sequential behavior is enforced by requiring serializations respect the exact order in which operations occurred in the program.
In contrast to Sequential Consistency (SC), more relaxed memory models utilize different rules for instruction reordering, wherein the instructions within a thread may be partially ordered rather than totally ordered as in SC. Examples of more relaxed memory models include the memory models of the PowerPC architecture, see C. May, et al, The PowerPC Architecture: A Specification for A New Family of RISC Processors, Morgan Kaufmann, 1994, and the RMO model for the SPARC architecture, see D. L. Weaver, et al, The SPARC Architecture Manual (Version 9), Prentice-Hall, 1994.
Most memory models have a store atomicity property, and it is this property that is enforced by cache coherence protocols. Store atomicity describes inter-thread communication via memory and describes the ordering constraints, which must exist in serializable models. An example of a memory model that does not obey store atomicity is the Total Store Order (TSO) memory model of the SPARC Architecture. The only reordering permitted in TSO is that a later load instruction may bypass an earlier store instruction. Local load operations are permitted to obtain values from a store operation before it has been committed to memory. The TSO memory model specifies that all stores be performed in a strict order so that within one thread, all stores are performed in program order, and in a symmetric multi-processing (SMP) arrangement, all threads must observe a consistent ordering of stores from other threads.
Regardless of the implemented memory model that provides rules that specify in what order memory operations may be performed relative to program order and relative to other memory operations, there needs to be a point within the memory hierarchy that serves as a reference for all store operations. This point is referred to as a global ordering point. The global ordering point is responsible for ensuring that all consumers will see a consistent and proper ordering of store operations. This is typically accomplished by requiring that a cache line be in an exclusive state of a cache coherency protocol, such as MESI, before executing a store operation.
A problem arises when a cache line being accessed is not in an exclusive state, and, therefore, needs to be acquired. Line acquisition may take a very long time, potentially requiring system level coherence operations and/or fetching data from relatively slow dynamic-random-access-memory (DRAM). In the meantime, since store instructions may need to be executed in program order, any following store instructions are queued while these instructions wait for the previous store instruction to complete. Two problems may occur. First, if a later store instruction also misses, whether the cache line is actually missing from the cache or the store instruction accesses a cache line not in an exclusive state, then the long latencies become serialized, which leads to very long execution times.
Second, queuing many later store instructions, which includes respective addresses, data, and control and status information, can be expensive. Both on-chip real estate is consumed and access times increase with increasing size of such a queue. For example, searches in the queue may be implemented by circuitry for a content-addressable-memory (CAM) comparison of address and valid status information of all entries within the queue. Circuitry for CAM match comparisons typically utilize dynamic logic that consume a relatively high amount of power. An access time of an array utilizing CAM comparison circuitry may be a factor in determining a processor's clock cycle duration. Therefore, the size of the queue has an upper limit based on both timing requirements and power consumption. These caveats are experienced especially in the case a cache, such as a L2 cache, is shared by multiple cores and/or threads. These queues need to have a finite and manageable size.
However, filling these queues results in the shared cache rejecting new requests. Data collected from various benchmarks show that store instruction misses, as described above, are relatively uncommon. Therefore, additional store instruction misses occurring while a first one is pending would be more rare. However, such a situation may still occur, and one occurrence complete with very long serialized latencies may result in the store instruction miss queues, or miss buffers, filling up. Filled miss buffers cause the processor core store queue to also fill up. This event causes the processor execution to halt, which needs to be avoided.
In view of the above, efficient methods and mechanisms for efficient handling of store misses corresponding to multiple threads are desired.