Users of computer systems constantly demand improved performance, and designers of circuits and systems respond with a variety of techniques to speed calculations. Two techniques that have had good success are caching and multiprocessing.
Caching is the practice of storing a copy of data in a location from which it can be retrieved more quickly than by reference to the place from which the data was copied. For example, information may be stored in the main memory of a system with a copy cached in a processor cache, because the processor can usually access information in its internal cache faster than it can obtain the same information from main memory. In some systems, more than one level of cache may be provided, with each level permitting access that is improved in some way over outer levels. The cache that is furthest from the processor in the cache hierarchy, or closest to the main memory, is called the “last level cache.” The closer a cache is to the processor, the smaller it tends to be. For example, a central processing unit (“CPU”) may have an innermost level one (“L-1”) cache internal to the processor, and larger, slower level two (“L-2”) and level three (“L-3”) caches fabricated on the same die. In this example, the outer L-3 cache is the last level cache.
All caching schemes must take precautions to ensure that the cached copies are consistent with the original data; that is, they must prevent the use of old, outdated, or “stale” cached copies when the original data has changed.
A multiprocessor system has two or more processors that operate independently, but share some memory and other resources. Some individual processors add another level of multiprocessing by operating on two or more separate instruction streams within each “core” of the processor; this is commonly called “hyper-threading.” Each processor in a multiprocessor system must provide for synchronization to manage contention for, and to prevent corruption of, shared resources.
When caching is combined with multiprocessing in, for example, a multiprocessor system where some processors include an internal cache memory, the normal problem of ensuring consistency between a processor's cache and the contents of main memory is complicated by the requirement that all processors maintain a consistent view of shared data in main memory. This problem has been addressed by a device known as a “snoop filter,” which is a performance enhancing feature that helps reduce unnecessary snoops onto remote front side buses (“FSBs”). The snoop filter resides logically between the processors and the shared memory and monitors the operations of the processors to maintain a database of memory locations whose contents may be held in a cache of one or more processors.
Snoop filter operations are critical to the correct and efficient operation of a multiprocessor system. If a snoop filter fails to detect that a processor has cached a copy of certain data, then it is possible for that processor to operate on stale data (with potentially disastrous results). On the other hand, a snoop filter that tracks a lot of stale cache lines which are no longer present in a processor's cache will rob the system of performance improvements that the cache could have provided.
Current snoop filters operate by maintaining a coherent directory relating shared memory addresses to the one or more processors in the system that may have cached data at those addresses. Since this directory is usually of fixed size, the entries are a limited resource for which the processors may contend. In particular, when one or more of the processors in a system are engaged in memory-intensive operations that frequently cause new data to be loaded into the processors' caches, the snoop filter can quickly become full. Once full, each new cache fill may require the snoop filter to evict an existing entry so that it can store information about the new cache entry. When an entry is evicted, the snoop filter sends a “back-invalidation” signal to all connected processors, causing them to evict any data from the old address. If any of the processors were still using that data, they will have to reload it before continuing. These cache reloads will consume front-side bus bandwidth may cause additional snoop filter entry evictions with their associated back-invalidation signals. In extreme cases, the system can begin thrashing: most bus cycles and processing time are consumed by cache invalidations and subsequent reload operations.