1. Field
The disclosed embodiments relate to shared-memory multiprocessor systems. More specifically, the disclosed embodiments relate to an interface that supports targeted-store instructions, which are store instructions that are directed to a specific location (e.g., cache) in a shared-memory multiprocessor system.
2. Related Art
Shared-memory multiprocessor systems are continuing to grow in size, with increases both in the number of cores per chip, and the number of chips in a system. Moreover, there are differences in the number and size of caches, how they are shared (or not), and latencies between various levels of cache within and between chips, and to local and remote memory. Despite these differences, as systems grow, the latency of accessing remote elements (e.g., cache or memory) inherently grows relative to the latency of accessing local elements. That is, systems are increasingly NUMA (Non-Uniform Memory Access), and the NUMA constants (ratios of latencies to access remote and local elements) are growing.
Significant challenges for programmers accompany these changes. Software that has performed acceptably on smaller systems can suffer severe performance degradation when scaled to larger systems, especially due to NUMA effects.
Consider, for example, a hypothetical application running on a single-socket, multi-core system. Suppose the working set of the application is such that it fits comfortably in an on-chip cache (say L2), so that it exhibits good cache locality and performs well. In particular, when one thread accesses a memory location that has recently been modified by another thread, the location is likely to be in the on-chip L2 cache, in which case the access hits in the cache and no off-chip communication is required to satisfy the memory request. Otherwise, the location is stored in a memory that is physically close to the (single) processor chip.
Consider now a larger system with multiple processor sockets. Memory that is located physically close to one processor is necessarily further from others. Similarly, the caches of other processors are physically further away than a processor's own caches. Broadly, systems meeting this description are referred to as NUMA (Non-Uniform Memory Architecture). If the same application is configured now to run on such a system, even though its working set may still fit comfortably in cache, we now have threads running on different chips, and therefore inter-chip communication is required to keep the caches on the multiple chips coherent. In this case, when one thread accesses a memory location that has recently been modified by another, it is likely that the other thread is on a different chip. In this case, if the location is still in a cache near the thread that recently modified it, then it needs to be invalidated or downgraded in that cache, and brought into the cache of the thread performing the subsequent access. Alternatively, the location may no longer be cached; it may be stored at its home memory node, which is likely to be memory other than the memory located physically close to the thread performing the subsequent access.
The first problem in this scenario is obvious: the latency to access a memory location can increase significantly as system sizes grow. Perhaps less obviously, the bandwidth available for coherence and data communication is not growing at the same rate that the number of cores in systems is growing. Therefore, the problem may be further exacerbated when the coherence and memory traffic produced by an application or set of applications approach the bandwidth limitations of the system.
Therefore, techniques for reducing the amount of remote communication required by applications are needed, as well as techniques for reducing the cost—in terms of latency, bandwidth, or both.