The present invention relates in general to the caching of data in multiprocessor systems and, more particularly, to simplified cache coherence protocols for shared memory multi-core processing systems utilizing self-invalidation and multiple write policies to maintain coherence in a private/shared cache hierarchy.
In a multiple processor environment, two or more microprocessors (referred to as multiple core, multi-core and many-core) reside on the same chip and commonly share access to the same area of main memory via a cache hierarchy. Shared-memory microprocessors simplify parallel programming by providing a single address space even when memory is physically distributed across many processing nodes or cores. Most shared-memory multiprocessors use cache memories or “caches” to facilitate access to shared data, and to reduce the latency of a processor's access to memory. Small but fast individual caches are associated with each processor core to speed up access to a main memory. Caches, and the protocols controlling the data access to caches, are of highest importance in the multi-core parallel programming model.
To satisfy coherence definitions, coherence protocols react immediately to writes and invalidate all cached read copies. Shared-memory systems typically implement coherence with snooping or directory-based protocols. Directory-based cache coherence protocols are notoriously complex, requiring a directory to constantly track readers and writers and to send invalidations or global broadcasts or snoops. Directory protocols also require additional performance and transient states to cover every possible race that may arise. For example, the GEMS [23] implementation of the MESI directory protocol, a direct descendant of the SUNfire coherence protocol, requires no less than 30 states. Verification of such protocols is difficult and in many cases incomplete [1].
Significant complexity in current protocols comes from strategies for efficicent execution of sequential applications. Complexity in cache coherence protocols also translates into cost. Storage is needed for cache-line state, the directory (or dual-ported/duplicate tags for snooping), and the logic required by complex cache and directory controllers. Significant effort has been expended to reduce these costs, especially the storage cost, but also the verification cost.
In terms of performance and power, complex protocols are characterized by a large number of broadcasts and snoops. Here too, significant effort has been expended to reduce or filter coherence traffic with the intent of making complex protocols more power or performance efficient. In particular, in the many-core cases, a simple and efficient implementation of coherence is of great importance to match the simplicity of the many thin cores. Furthermore, some many-core programming models (e.g., CUDA [21] and CELL [16]) exercise explicit control of the ownership of shared data. In the following description, the term “multi-core” will be used to refer to both multi-cores and many-cores, as the systems and methods described herein have application in all multiple processor core systems. The coherence schemes commonly utilized in the current processing environment have been developed for multi-chip SMPs or distributed shared memory machines where the trade-offs are markedly different from a multi-core cache hierarchy.
Recent research has realized the importance of classifying private and shared data. Some of this research has focused on using hardware for classifying private vs. shared data. Other research has focused on using the operating system or the compiler to perform the classification. The advantage of hardware mechanisms is that they can work at a granularity of a cache line. However, these mechanisms can also have prohibitive storage requirements. Techniques which employ the operating system do not require extra hardware, as the data classification can be stored along with the page table entries (PTEs) at a page granularity. However, if a single block in a page is shared (or even if two different private blocks within the same page are accessed by different cores) the whole page must be considered as shared, thus leading to misclassified blocks. Finally, the disadvantage of the compiler-assisted classification is that it is difficult to know at compile time if a variable is going to be shared or not.
Different proposals have used the private verses shared data classification to reach different goals. Some have utilized the classification to perform an efficient mapping for NUCA caches [14, 20]. While others have used the classification to reduce the number of broadcasts required by a snooping protocol [18], or to reduce the size of the directory cache in a directory-based protocol [11, 12]. Finally, others use the classification for choosing among different behaviors for the coherence protocol [28, 15].
Dynamic self-invalidation and tear-off copies were first proposed by Lebeck and Wood as a way to reduce invalidations in cc-NUMA [19]. The basic idea is that cache blocks can be teared off the directory (i.e., not registered there as cached copies) as long as they are discarded voluntarily before the next synchronization point by the processor that created them. As noted in their paper, this can only be supported in a weak consistency memory model (for sequential consistency (SC), self-invalidation needs to be semantically equivalent to a cache replacement). Lebeck and Wood proposed self-invalidation and tear-off copies as an optimization on top of an existing cc-NUMA protocol. Furthermore, they made an effort to restrict its use only to certain blocks through a complex classification performed at the directory. Their design choices reflect the tradeoffs of a cc-NUMA architecture: that self-invalidation should not be applied indiscriminately because misses to the directory are expensive.
Self-invalidation was recently used by Kaxiras and Keramidas in their “SARC Coherence” proposal [17]. In their proposal, the authors observe that with self-invalidation, writer prediction becomes straightforward to implement. The underlying directory protocol is always active to guarantee correctness. Despite the advantage for writer prediction, however, their proposal increases the complexity of the base directory protocol with another optimization layer and leaves the directory untouched.
Finally, Choi et al. use self invalidation instructions, inserted by the compiler after annotations in the source program, in their application-driven approach [10]. Based on the properties of disciplined parallelism, Choi et al. simplify coherence. However, their approach relies on significant feedback from the application, which must define memory regions of certain read/write behavior, and then convey and represent such regions in hardware. This requires programmer involvement at the application layer (to define the regions), compiler involvement to insert the proper self-invalidation instructions, an API to communicate all this information to the hardware, and additional hardware near the L1 to store this information. The DeNovo approach described by Choi et al. self-invalidates the “touched” data in a phase. The DeNovo approach still implements a directory (“registry”) that tracks the writers (but not the readers), and implements the directory in the data array (since shared cache data are stale in the presence of a writer). Although the directory storage cost is hidden, there is still directory functionality in the shared cache.
Consequently, a significant need exists for an improved method of maintaining cache coherence within a multi-core architecture to simplify the verification process and; thereby, reduce the cost and complexity throughout a shared memory processing environment without sacrificing power and performance. Additionally, a significant need exists for a simplified method of maintaining cache coherence which eliminates the need for directories, invalidations, broadcasts and snoops while maintaining or improving performance. Existing prior art cache systems and protocols need improvements to fully take advantage of the multi-core architecture. In particular, the number of unnecessary operations needs to be significantly reduced.