A symmetric multi-processing (SMP) system contains one or more CPU cell boards. A CPU cell board contains one or more CPUs, cache, and memory. The cell-boards are connected by a ‘system fabric’, typically, a set of links, including one or more crossbar switches. Data can be shared between cell boards (and also between CPUs on a single cell-board), but a protocol must be followed to maintain cache coherency. Although caches can share data, the same memory address can never have different values in different caches.
A common cache coherency implementation uses directories, called cache-coherency directories, which are associated with each cache. A cache coherency directory records the addresses of all the cache lines, along with the status (e.g., invalid, shared, exclusive) and the location of the line in the system. A bit vector is generally used to represent the cache line location, each bit corresponding to a processor or (in some implementations) a processor bus. Given the information in the cache coherency directory, a protocol is implemented to maintain cache coherency.
In a typical cache-coherency protocol, each cache address has a home directory, and exclusive reads that miss cache anywhere in the system go first to that directory. Unless the address missed is ‘local’ (to a particular cell board), a cache miss must make a traverse of the system fabric. Depending on the configuration of the system fabric, one or more crossbar switches must be crossed to reach the home directory for a particular address (if the address missed was not a local address).
Once the request reaches the home directory, there are three possible cases:                1. No remote copies of the line exist (as determined from the directory). In this case, the home directory fetches the requested line from either local memory or local cache, and sends it across the system fabric to the requesting CPU.        2. remote shared copies exist. In this case the home directory must send ‘invalidate’ commands to all the nodes that contain copies, and then fetch the requested line from either local memory or local cache, and send it across the system fabric to the requesting node.        3. remote exclusive copy exists. The home directory fetches that copy, invalidates it in the remote cache, and then sends it to the requesting node.        
The latency of these requests is dominated by the number of hops, each of which increases the time required to traverse the system fabric. As SMPs grow larger, the number of crossbar switches and the length of the links between nodes increase, which in turn lengthens the average time required to make a hop. Also, some SMP designs keep the directory in main memory (as opposed to cache), causing case 3, above, to take even more time.
There is a design problem with the above type of system that parallels the performance problem: increasing the number of crossbar switches and links reduces hop latency, but the expense of these extra components, and the difficulty of designing them into a cabinet of a given size, and of powering and cooling them, are not trivial issues.