In many multiprocessor systems, memory devices are organized in hierarchies including main memory and one or more levels of cache memory. Data can reside in one or more of the cache levels and/or main memory. Cache coherence protocols are used in multiprocessor systems to address the potential situation where not all of the processors see the same data value for a given memory location.
Recently, architectures have been introduced where processors (or cores), and their respective cache memory devices, are grouped together into clusters. This can reduce network congestion by localizing traffic among several hierarchical levels, potentially enabling much higher scalability.
Memory systems are said to be coherent if they see memory accesses to a single data location in order. This means that if a write access is performed to data location X, and then a read access is performed to the same data location X, the memory hierarchy should return X regardless of which processor performs the read and write and how many copies of X are present in the memory hierarchy. Likewise, coherency also typically requires that writes be performed in a serialized manner such that each processor sees those write accesses in the same order.
There are various types of cache coherency protocols and mechanisms. For example, “explicit invalidation” refers to one mechanism used by cache coherence protocols wherein when a processor writes to a particular data location in a cache then all of the other caches which contain a copy of that data are flagged as invalid by sending explicit invalidation messages. An alternative mechanism is updating wherein when a processor writes to a particular data location in a cache, then all of the other caches which contain a copy of that data are updated with the new value. Both of these cache coherence mechanisms thus require a significant amount of signaling, which scales with the number of cores (or threads) which are operating in a given data processing system. Accordingly, these various cache protocols and mechanisms are known to have their own strengths and weaknesses, and research continues into improving cache coherency protocols with an eye toward maintaining (or improving) performance while reducing costs (e.g., energy consumption) associated with coherency traffic.
In their road map to scalable on-chip cache coherence, Martin et al. in their article entitled “Why on-chip cache coherence is here to stay,” published in Communications of the ACM, vol. 55, pp. 78-89, July 2012, advocate that hierarchical and clustered design techniques provide a natural methodology for future scalable systems to overcome two main scalability problems of coherence: storage and traffic. Storage is drastically reduced by requiring the last-level cache to track only the clusters—not the individual cores inside each cluster. Global traffic is also reduced since portions of coherence transactions are handled inside the clusters, thus eliminating inter-cluster communication. As a direct result of intra-cluster locality, the last-level cache sends only a single invalidation message to a cluster and receives only a single acknowledgment message from that cluster each time a data block needs to be invalidated in all the cores inside that cluster.
Despite the arguments in support of clustered cache hierarchies, there are also obstacles to overcome as a prerequisite for their wide adoption by the industry. The prevalent obstacle is the complexity and cost due to the coherence that must be implemented. For example, a hierarchical, invalidation-based, MOESI directory protocol has a very high number of states, mainly in the intermediate-levels of the hierarchy. This high number of states is the result of the interplay between invalidation-based, directory coherence and clustering.
For example, invalidation-based, directory coherence must fundamentally perform two functions:
1. Invalidation upon write: upon a write miss, invalidate all other sharers.
2. Indirection and downgrade: upon a read miss, find the latest written value and downgrade the writer.
These two functions enforce the Single Writer Multiple Reader invariant and ensure that written values are propagated correctly. The complexity of a flat (non-hierarchical) directory providing this functionality is well understood and, although there is ample implementation experience, there are also significant advantages in simplifying even this case. In the case of a hierarchical clustered cache architecture, directory-based coherence becomes significantly more complex: it must also be performed hierarchically. A clustered cache hierarchy is handicapped if coherence is not implemented using a hierarchical directory and a hierarchical (tree) protocol. A single flat directory at the root of the hierarchy (e.g., the lower level cache or LLC) simply negates the scalability of the whole approach and proves problematic in handling caching in intermediate levels between the root (LLC) and the leaves (L1s).
Thus, both the invalidation and the indirection/downgrade functions have to be performed hierarchically. This means that intermediate nodes must have the ability to simultaneously behave both as root caches/directories (i.e., send invalidations, collect acknowledgements, indirect requests, as does the LLC) and as leaf caches (i.e., respond to invalidations and/or downgrades, as do the L1s). Moreover, one personality (leaf or root) can invoke the other recursively. For example, invalidations treat nodes in intermediate levels both as leaf nodes to be invalidated but also cause them to behave as root nodes initiating new invalidations in their sub-cluster (similarly for downgrade requests). It is this dual behavior and the resulting cross-product of the states of the two personalities (root and leaf) in intermediate levels that increases the implementation complexity to prohibitive levels. Verification becomes inordinately costly and time to market may be dangerously compromised.
Accordingly, it would be desirable to provide systems and methods that avoid the afore-described problems and drawbacks associated with the handling of coherence in systems employing clusters of cores and caches.