A user's view of a shared-memory system is elegantly simple, all processors read and modify data in a single shared store. This makes shared-memory multiprocessors preferable to message-passing multicomputers from the user's point of view. Most shared-memory multiprocessors accelerate memory accesses using per-processor caches. Caches are usually transparent to software through a cache coherence protocol. There are many different approaches to cache architectures, such as directory-based coherence protocols (cc-NUMA multiprocessors) that allow a large number of processors to share a single global address space over physically distributed memory, and snooping-based ones (SMP designs). The main difficulty in such designs is to implement the cache coherence protocol in such an efficient way that minimizes the usually long L2 miss latencies.
Snooping and directory protocols are the two dominant classes of cache coherence protocols for hardware shared-memory multiprocessors. Snooping systems (such as the Sun UE1000) use a totally ordered network to directly broadcast coherence transactions to all processors and memory. This way, lower latencies than directory protocols can be achieved for cache-to-cache transfer misses (for all sharing misses in general). Directory protocols transmit coherence transactions over an arbitrary point-to-point network to the corresponding home directories which, in turn, redirect them to the processors caching the line. The consequences are that directory systems (such as the SGI Origin 2000) can scale to large configurations, but they can have high unloaded latency because of the overheads of directory indirection and message sequencing.
Effective cache management is a critical factor in obtaining optimal application performance with the growing disparity in clock speeds between the processor and memory, combined with an increasing degree of processor instruction level parallelism. To cope with the memory latency penalty, there are two typical approaches: one is to reduce latency and the other is to tolerate latency. Techniques for reducing latency include maintaining locality of data references through locality optimizations. Techniques for tolerating latency include buffering and pipelining references, and prefetching under software control through the use of processor-specific features.
Prefetching by loading the next cache line in sequence can be implemented in hardware, software, or a combination of both. The software prefetch approach uses compile-time information to insert memory “touch” instructions, whereas the hardware approach detects memory reference patterns at run-time and automatically triggers memory touches. There is much published literature on prefetching to cope with the following issues: (1) reducing the overhead caused by explicit prefetch instructions; and, (2) placing prefetch instructions such that data is available when a demand load is issued.
The published literature on prefetching includes the following: Todd C. Mowry, Monica S. Lam and Anoop Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” 1992 Association for Computing Machinery; P. Cao, E. W. Felton, A. R. Karlin, and K. Li, “A Study of Integrated Prefetching and Caching Strategies”, Proceedings of ACM SIGMETRICS'95, pp. 188–197, May 1995; Callahan, D., Kennedy, K., and Porterfield, A., “Software Prefetching,” in Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (April), ACM, New York, 40–52; Klaiber, A. C. and Levey, H. M. “Architecture for Software-Controlled Data Prefetching,” in Proceedings of the 18th Annual International Symposium on Computer Architecture (May 1991), 43–63; and, Santhanam Vatsa, “Efficient Explicit Data Prefetching Analysis and Code Generation in a Low-level Optimizer for Inserting Prefetch Instructions into Loops of Applications,” U.S. Pat. No. 5,704,053.
Todd C. Mowry, Monica S. Lam, and Annop Gupta propose a software prefetch algorithm with the notion of identifying a prefetch predicate and the leading reference among multiple references to an array for selective prefetch with interaction of other transformations such as cache blocking and software pipelining. It is assumed that the arrays of interest are aligned on cache line boundaries. Santhanam Vatsa uses simple subscript expression analysis and explicit data cache prefetch instruction insertion with the integration of other low level optimization phases such as loop unrolling, register reassociation and instruction scheduling.
Well-known dependence and reuse analyses are detailed in the following references: Michael E. Wolf and Monica S. Lam, “A Data Locality Optimizing Algorithm,” SIGPLAN Notices 26, 6 (June 1991), 30–44, Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation; Dror E. Maydam, John L. Hennessy and Monica S. Lam, “Efficient and Exact Data Dependence Analysis,” Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, Toronto, Ontario, Canada, Jun. 26-28, 1991; Gina Goff, Ken Kennedt, Chau-Wen Tseng, “Practical Dependence Testing,” Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, Toronto, Ontario, Canada, Jun. 26–28, 1991; and, M. Wolfe, Chau-Wen Tseng, “The Power Test for Data Dependence,” Technical Report, Oregon Graduate Institute of Science and Technology.