Multicore processors continue to provide a hardware coherent memory space to facilitate effective sharing across cores. As the number of cores on a chip increases with improvements in technology, implementing coherence in a scalable manner remains an increasing challenge. Snoopy and broadcast protocols forward coherence messages to all processors in the system and are bandwidth intensive. They also have inherent limitations in both performance and energy and it is unlikely that they will be able to effectively scale to large core counts.
Directory-based protocols are able to support more scalable coherence by associating information about sharer cores with every cache line. However, as the number of cores and cache sizes increase, the directory itself adds significant area and energy overheads.
The conceptually simple approach is to adapt a full map sharer directory and associate a P-bit vector with every cache line, where P is the number of processors. Unfortunately, this makes the directory size dependent on the number of shared cache lines (M) and the number of processors, resulting in a directory size that is O(M*P).
FIG. 1 depicts a prior art baseline tiled multicore architecture and L2 Full Map: sharer vectors associated with cache line. As shown in FIG. 1, each tile in the multicore consists of a processor core, private L1 (both I and D) cache, and a bank of the globally-shared last-level L2 cache. Coherence at the L1 level is maintained using an invalidation-based directory protocol and directory entries are maintained at the home L2 bank of a cache line.
Full bit map directories are an attractive approach that was first proposed for multiprocessors but can be extended to maintain coherence in multicores with an inclusive shared cache. The sharing cores are represented as a bit-vector associated with each cache block, with each bit representing whether the corresponding core has a copy of the block. Sharer information is accessed in parallel with the data.
The Shadow tag approach, which is used in many current processors, require a highly associative and energy intensive lookup operation. While tagless lookup was recently proposed to optimize the shadow tag approach by compressing the replicated L1 cache tags, it uses a set of bloom filters to concisely summarize tags in each cache set. The energy intensive associative lookup needed by shadow tags is thus replaced with bloom filter tests.
Various other approaches have been proposed to reduce the area overheads of a full bit map directory, including the use of a directory cache, a compressed sharer vector, and pointer. Directory caches restrict the blocks for which precise sharing information can be maintained simultaneously. Compressed sharer vectors fix the level of imprecision at design time-all cache lines suffer from imprecision. Pointers incur significant penalty, for example, due to the need to revert to either software or broadcast mode, when the number of sharers exceeds the number of pointers.
What is needed is a method and system, such as a directory table, that takes advantage of the observation that many memory locations in an application are accessed by the same set of processors, resulting in a few sharing patterns that occur frequently and represents the subset of sharing patterns recognized. What is also needed is a method and system that decouples (e.g., does not require a one-to-one correspondence between directory entries and cache lines) the sharing pattern from each cache line and holds them in a separate directory table. What is also needed is for multiple cache lines that have the same sharing pattern to point to a common entry in the directory table. For example, with the directory table storing the sharing patterns, each cache line includes a pointer whose size is proportional to the number of entries in the directory.