In many multiprocessor systems, memory devices are organized in hierarchies including main memory and one or more levels of cache memory. Data can reside in one or more of the cache levels and/or main memory. Cache coherence protocols are used in multiprocessor systems to address the potential situation where not all of the processors see the same data value for a given memory location.
Memory systems are said to be coherent if they see memory accesses to a single data location in order. This means that if a write access is performed to data location X, and then a read access is performed to the same data location X, the memory hierarchy should return X regardless of which processor performs the read and write and how many copies of X are present in the memory hierarchy. Likewise, coherency also typically requires that writes be performed in a serialized manner such that each processor sees those write accesses in the same order.
There are various types of cache coherency protocols and mechanisms. For example, “explicit invalidation” refers to one mechanism used by cache coherence protocols wherein when a processor writes to a particular data location in a cache then all of the other caches which contain a copy of that data are flagged as invalid by sending explicit invalidation messages. An alternative mechanism is updating wherein when a processor writes to a particular data location in a cache, then all of the other caches which contain a copy of that data are updated with the new value. Both of these cache coherence mechanisms thus require a significant amount of signaling, which scales with the number of cores (or threads) which are operating in a given data processing system. Accordingly, these various cache protocols and mechanisms are known to have their own strengths and weaknesses, and research continues into improving cache coherency protocols with an eye toward maintaining (or improving) performance while reducing costs (e.g., energy consumption) associated with coherency traffic.
For example, recently a number of proposals have been set forth which aim to simplify coherence by relying on data-race-free semantics and on self invalidation to eliminate explicit invalidation traffic and the need to track readers at the directory. The motivation for simplifying coherence has been established in numerous articles, some of which are mentioned herein. For example, with the addition of self-downgrade, the directory can be eliminated, see, e.g., A. Ros and S. Kaxiras, “Complexity-effective multicore coherence,” in 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012, and virtual cache coherence becomes feasible at low cost, without reverse translation, see, e.g., S. Kaxiras and A. Ros, “A new perspective for efficient virtual-cache coherence,” in 40th International Symposium on Computer Architecture (ISCA), 2013. Significant savings in area and energy consumption without sacrificing performance, have also been demonstrated. Additional benefits regarding ease-of-verification, scalability, time-to-market, etc., are possible as a result of simplifying rather than complicating such fundamental architectural constructs as coherence.
In self-invalidation cache coherence protocols, writes on data are not explicitly signaled to sharers as is the case with explicit invalidation cache coherence protocols. Instead, a processor automatically invalidates its locally stored cache copy of the data. However, data races throw such self-invalidation protocols into disarray, producing non-sequential-consistent executions, see, e.g., A. R. Lebeck and D. A. Wood, “Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors,” in 22nd International Symposium on Computer Architecture (ISCA), 1995. All such proposals seen thus far offer sequential consistency for data-race-free (DRF) programs, see, e.g., S. V. Adve and M. D. Hill, “Weak ordering—a new definition,” in 17th International Symposium on Computer Architecture, 1990.
Data-race-free semantics require that conflicting accesses (e.g., a read and a write to the same address from different cores or processors) must be separated by synchronization (perhaps transitive over a set of threads). Self-invalidation is therefore initiated on synchronization.
There are situations where explicit invalidation may be preferred over self-invalidation. For instance, spin-waiting, also known as busy-waiting, which involves checking to see if a lock is available, can be performed more efficiently with explicit invalidations and local spinning on a cached copy, rather than repeatedly self-invalidating and re-fetching. While self-invalidation works well for race-free data, it shows an inherent weakness when it comes to spin-waiting. Entering a critical section, or just spin-waiting for change of state, requires repeated self-invalidation of the lock or flag variable. Herein lies the problem: spin loops cannot spin on a local copy of the synchronization variable which would be explicitly invalidated and re-fetched only with the writing of a new value in write-invalidate protocols. Repeated self-invalidation in local caches leads to excessive traffic to the shared last-level cache (LLC) in the system, wasting bandwidth and/or energy. In the text below, the shared LLC is also sometimes referred to as a “global cache” or a “shared cache”.
The solutions that have been proposed to this problem with self-invalidation protocols thus far are costly. For locks, they involve some form of hardware queuing either with a blocking bit in the LLC cache lines and request queuing in the LLC controller when this bit is set, or with a full-blown hardware implementation of queue locking, see, e.g., J. R. Goodman, M. K. Vernon, and P. J. Woest, “Efficient synchronization primitives for large-scale cache-coherent multiprocessors” ACM, 1989, vol. 17, no. 2, and H. Sung, R. Komuravelli, and S. V. Adve, “DeNovoND: Efficient hardware support for disciplined non-determinism,” in 18th International Conference on Architectural Support for Programming Language and Operating Systems (ASPLOS), 2013. The cost and complexity of these proposals is not trivial. Further, they tie the lock algorithm to the specifics of the hardware implementation (so the lock algorithm inherits, for better or worse, whatever fairness, starvation, live-lock properties, etc. are offered by the hardware mechanism).
One option is to consider reverting back to explicit invalidation for a small set of addresses, namely spin variables. However, explicit invalidations are unsolicited and unanticipated, giving rise to a number of drawbacks that make them unappealing. Because they are unanticipated, explicit invalidations cause significant protocol state explosion to resolve protocol races. Because they are unsolicited, explicit invalidations break the mold of a simple request-response protocol, meaning that they cannot be used for virtual caches without reverse translation.
Accordingly, it would be desirable to provide systems and methods that avoid the afore-described problems and drawbacks associated with the handling of spin waiting and other event monitoring situations without using explicit invalidations as part of the event monitoring mechanism.