Symmetric Multiprocessing (SMP) is a multiprocessor system where two or more identical processors are connected, typically by a bus of some sort, to a single shared main memory. Since all the processors share the same memory, the system appears just like a “regular” desktop to the user. SMP systems allow any processor to work on any task no matter where the data for that task is located in memory. With proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently. Consequently, SMP has many uses in science, industry, and business, where software is specially programmed for multithreaded processing.
In a bus-based system, a number of system components are connected by a single shared data path. To make a bus-based system work efficiently, the system ensures that contention for the bus is reduced through the effective use of memory caches (e.g., line caches) in the CPU which exploit the concept, called locality of reference, that a resource that is referenced at one point in time will be referenced again sometime in the near future. However, as the number of processors rise, CPU caches fail to provide sufficient reduction in bus contention. Consequently, bus-based SMP systems tend not to comprise large numbers of processors.
Distributed Shared Memory (DSM) is a multiprocessor system that allows for greater scalability, since the processors in the system are connected by a scalable interconnect, such as an InfiniBand switched fabric communications link, instead of a bus. DSM systems still present a single memory image to the user, but the memory is physically distributed at the hardware level. Typically, each processor has access to a large shared global memory in addition to a limited local memory, which might be used as a component of the large shared global memory and also as a cache for the large shared global memory. Naturally, each processor will access the limited local memory associated with the processor much faster than the large shared global memory associated with other processors. This discrepancy in access time is called non-uniform memory access (NUMA).
A major problem in DSM systems is ensuring that the each processor's memory cache is consistent with each other processor's memory cache. Such consistency is called cache coherence. A statement of the sufficient conditions for cache coherence is as follows: (a) a read by a processor, P, to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P; (b) a read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated and no other writes to X occur between the two accesses, and (c) writes to the same location are serialized so that two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors do not read the value of the location as 2 and then later read it as 1.
Bus sniffing or bus snooping is a technique for maintaining cache coherence which might be used in a distributed system of computer nodes. This technique requires a cache controller in each node to monitor the bus, waiting for broadcasts which might cause the controller to change the state of its cache of a memory block. Typically, the states for a memory block in a cache include “dirty” (or “modified”), “valid” (“owned” or “exclusive”), “shared”, and “invalid”. It will be appreciated that the parenthesized states are often referred to as the states of the MOESI (Modified Owned Exclusive Shared Invalid) coherence protocol. See U.S. Pat. No. 5,706,463. On a read miss by a node (e.g., a request to load data), the node's cache controller broadcasts, via the bus, a request to read a block and the cache controller for the node with a copy of the block in the state “dirty” changes the block's state to “valid” and sends a copy of the block to the requesting node. On a write miss by a node (e.g., a request to store data), the node's cache controller transitions the block into a “valid” state and broadcasts a message, via the bus, to the other cache controllers to invalidate their copies of the block. Once the node has written to the block, the cache controller transitions the block to the state “dirty”. Since bus snooping does not scale well, larger distributed systems tend to use directory-based coherence protocols.
In directory-based protocols, directories are used to keep track of where data, at the granularity of a cache block, is located on a distributed system's nodes. Every request for data (e.g., a read miss) is sent to a directory, which in turn forwards information to the nodes that have cached that data and these nodes then respond with the data. A similar process is used for invalidations on write misses. In home-based protocols, each cache block has its own home node with a corresponding directory located on that node.
To maintain cache coherence in larger distributed systems, additional hardware logic (e.g., a chipset) or software is used to implement a coherence protocol, typically directory-based, chosen in accordance with a data consistency model, such as strict consistency. DSM systems that maintain cache coherence are called cache-coherent NUMA (ccNUMA). Of course, directory-based coherence protocols and data consistency models introduce latency into the system, which might severely degrade performance, if not properly managed within the overall system design. In this regard, see European Patent Application Ser. No. EP1008940A2 and U.S. Pat. No. 7,107,408, as well as M. E. Acacio, J. Gonzlez, J. M. Garca, and J. Duato, Owner Prediction for Accelerating Cache-to-Cache Transfers in a cc-NUMA Architecture (in Proceedings of SC2002).
Advanced Micro Devices has created a server processor, called Opteron, which uses the x86 instruction set and which includes a memory controller as part of the processor, rather than as part of a northbridge or memory controller hub (MCH) in a logic chipset. The Opteron memory controller controls a local main memory for the processor. In some configurations, multiple Opterons can use a cache-coherent HyperTransport (ccHT) bus, which is somewhat scalable, to “gluelessly” share their local main memories with each other, though each processor's access to its own local main memory uses a faster connection. One might think of the multiprocessor Opteron system as a hybrid of DSM and SMP systems, insofar as the Opteron system uses a form of ccNUMA with a bus interconnect.