Recent trends have been for large computer systems to be built not as a single large processor, but from smaller, often modular building blocks, hereafter referred to as nodes, each containing their own processors and memory and for these nodes to be loosely or tightly coupled to form a large multiprocessor system image. These system structures reduce the total number of independent systems that must be maintained, and allow the flexibility of running a small number of large workloads or a large number of small workloads.
However, in the case of a tightly coupled system (e.g., shared memory systems with a smaller package or available real estate for the nodes) this configuration may compromise the system performance for smaller applications. In a tightly coupled system there is a shared memory model wherein programs or applications can be run on any processor or node in the system. Thus, as the programs move between the nodes related data is in the shared memory of the nodes of the system. Thus, as the program moves snoops may have to be made across the system to determine the location of the most recent copy of the data. As these snoops are made they are broadcast to all of the nodes since the system does not know where the most recent copy of the requested data is. Such broadcasting increases the coherency traffic between the nodes and may limit the system performance due to queueing of the requests in the communication links connecting the nodes.
This is because coherency checking across memories and caches of larger system structures in order to find the most recent copy of the requested data increases the sensitivity to misses in the local cache by driving additional coherency checking traffic across the connections between the nodes.
As the system grows in size, the related increase in coherency checking extracts an increasing performance penalty from an interconnect structure that necessarily has limited capacity and response time due to package restrictions.
This increasing performance penalty, which is related to coherency checking on the interconnects, causes problems with scaling to larger structures and limits the effectiveness of such larger structures.
Prior to the methods and apparatus of the present application large modular systems had two choices: 1) they could either be run as a large single image with potentially multiple logical partitions running operating system images that potentially span multiple nodes requiring storage coherency checking across the entire system complex which introduced a performance penalty for the checking of all memory accesses across the entire system, or 2) they could be physically or firmly partitioned into inflexible separate operating zones that avoided this storage coherency checking on the system fabric, but as a result had no access to the memory of any other zone.
Accordingly, there was no in-between option that allowed the flexibility of applications in some zones having access to memory across the complex while others enjoyed the efficiency and speed of local memory access without involving coherency checking across the system fabric, and there was no mechanism for transitioning from one mode to the other in a dynamic, streamlined fashion.
Therefore, it is desirable to provide an apparatus and method for keeping track of where data has been cached and minimizing the amount of broadcasts related to coherency checking across the system.