Shared memory multiprocessor systems allow each of multiple processors to reference any storage location (memory) in the system through read and write (load and store) operations. The underlying structure of the shared memory is hidden from the processors, or programs, except insofar as performance is concerned.
A single memory location may be updated by multiple processors. The result is a single sequence of updates, and all processors see the updates to that memory location in the same order. This property is known as "coherence". On a coherent system, no processor can see a different order of updates than another processor.
Cache coherent, shared memory multiprocessor systems provide caches to the memory structure in order to improve performance (reduce latency) of memory accesses. Because the caches are kept coherent, the characteristic of a single sequence of updates for a given memory location, as seen by all processors in the system, is maintained.
The system architectures discussed in this patent are cache coherent, shared memory multiprocessor systems. Three specific variations of these systems are described below, namely, UMA, NUMA, and S-COMA.
"UMA" refers to Uniform Memory Access, and describes a system architecture wherein multiple processors in a computer system share a real address space, and the memory latency from any processor to any memory location is the same or uniform. That is, a given processor can reference any memory location in uniform time. Most modern symmetric multi-processors (SMP) are UMA systems. FIG. 1 shows a typical UMA system 10 configuration. A number of processors 12 are connected to a common system bus 14, as is a memory 16. Because the path from any processor 12 to any location in memory 16 is the same (i.e., across the system bus), the latency from any processor to any memory location is the same.
FIG. 1 also shows caches 18. There must be a cache coherence protocol which manages caches 18 and ensures that updates to a single memory location are ordered, so that all processors will see the same sequence of updates. In UMA systems, such as the one depicted, this is frequently accomplished by having each cache controller "snoop" on the system bus. This involves observing all transactions on the bus, and taking action (i.e., participating in the coherence protocol) when an operation on the bus refers to a memory location which is being held in the snooper's cache.
The benefit in this type of organization is that parallel programming is simplified, in that processes can be less sensitive to data placement; i.e., data can be accessed in a particular amount of time, regardless of the memory location used to hold the data.
The drawback of this type of organization is that UMA systems do not scale well. As larger and larger systems are designed (with more and more processors and memory), it becomes increasingly difficult and costly to maintain the uniformity of memory access times. Furthermore, schemes which require cache controllers to snoop require a common communications medium, such as a common system bus, for data addresses. However, the system bus is a serial resource which becomes overloaded as more processors and more memory operations are placed on it. When the system bus is saturated, the addition of more or faster processors does not improve system performance.
A further system variation is "NUMA", which refers to Non-Uniform Memory Access, and describes a system architecture wherein multiple processors in a computer system share a real address space where memory latency varies depending on the memory location being accessed. That is, some memory locations are "closer" to some processors than to others. Unlike an UMA system, all memory locations are not accessible from a given processor in equal time; i.e., some memory locations take longer to access than others, hence memory access times are non-uniform.
As shown in FIG. 2, a NUMA system implements distributed shared memory; i.e., the total system memory is the sum of memories M.sub.1, M.sub.2, M.sub.3 in nodes 22. There is a single real address space which is shared by all the nodes 22 in the system 20 and, in FIG. 2, each node contains one third of the system memory. Each node 22 includes an UMA system 10. A number of nodes are connected to a common communications fabric or network 24, each through a Network Interface (NI) 26.
A processor in one node may access a memory location in another node via a load or store instruction. The NUMA Memory Controller (NMC) 28 function is responsible for capturing the memory request on the local node's system bus and forwarding it to the node which contains the target memory location (i.e., the home node). Because the path from one processor to a remote memory location is further than the path from the same processor to a local memory location, the memory access times are non-uniform.
As with the UMA system, caches are kept coherent through some protocol. All processors on all nodes will see updates to a single memory location as serialized. However, unlike UMA systems, NUMA systems typically do not broadcast memory operations to all nodes so that all cache controllers can snoop on them. Instead, the home node NMC is responsible for forwarding coherence requests to the remote nodes which would be interested in them. In typical NUMA implementations, each NMC maintains a directory for all of the memory in its node. This directory tracks each cache line of local memory, keeping the state of the line and an awareness of which other nodes are caching the line. For example, NMC 28 in Node 1 tracks all of the memory in M.sub.1. If a memory operation occurs on Node 1, NMC 28 on Node 1 consults its directory, and may forward the request to all nodes which have the target line cached. A sample data flow for remote memory access in a NUMA system is well described in U.S. Pat. No. 5,710,907, entitled: "Hybrid NUMA COMA Caching System and Methods for Selecting Between the Caching Modes", the entirety of which is hereby incorporated herein by reference.
The benefit of such an architecture is that it is easier to build a system which scales beyond the limitations of an UMA system. The primary reason for this is that all cache controllers do not have to snoop on a single, common communications fabric. Instead, they snoop only on a local fabric, which sees memory operations only when they affect that node.
The drawback of a NUMA system is that performance sensitive programs will perform differently depending on where data is placed in memory. This is particularly critical for parallel programs, which may share data between many threads of execution of the program.
A secondary problem, which exacerbates the increased memory latency of the distributed memory, is the limited cache sizes of NUMA systems. Some NUMA systems do not provide greater caching capabilities than the SMPs out of which they may be built, in which case the increased memory latency lessens the benefit of the hardware caches. Alternatively, another level of hardware caching can be provided, possibly in the NMC. However, this tends to be dedicated hardware, which means an expense which becomes practical only when a large number of accesses to remote memory exists.
As a further system variation, "S-COMA" refers to Simple Cache Only Memory Architecture (i.e., a variant of COMA, the Cache Only Memory Architecture), and describes a distributed shared memory architecture wherein multiple processors in a computer system can transparently access any memory in the complex, and memory latency varies depending on the memory location being accessed. However, unlike NUMA systems, nodes maintain independent real address spaces. A portion of the local real memory of each node is used as a cache, allocated in page sized chunks by system software. The specifics of S-COMA operation are well described in the above-incorporated U.S. Pat. No. 5,710,907.
One benefit of such an architecture is that it is easy to build a system which scales better than UMA or NUMA. The primary reason for this is that each node manages only its local real memory space, reducing system complexity and internode interference. Also, as with NUMA, all cache controllers do not have to snoop on a single, common communications fabric. Instead, they snoop only on a local fabric, which sees memory operations only when they affect that node.
Additionally, S-COMA provides better average latency than NUMA for many programs by providing very large main memory caches. Because much larger caches are provided, the number of cache misses, and therefore remote memory accesses, can be greatly reduced, thereby improving program performance. In addition, S-COMA provides better scalability and node isolation characteristics than NUMA by reducing contention for memory management data structures. It also limits direct memory access from remote nodes by filtering addresses through a translation function.
Referring to FIG. 3, with S-COMA architecture a global memory object is created and allocated a global address (GA), and a single node is designated the home node 30 for any particular data page. The global object is attached to each interested process' address space by assigning the object a virtual address (VA). This VA is subsequently translated using page tables (PT) into a local real address (RA).
Each node 30, 32 maintains an S-COMA cache 34; i.e., a cache in main memory 36 which is maintained by the S-COMA subsystem. When the global data area is referenced, a slot in the S-COMA cache is allocated, and the data is made memory resident in the home node 30 by placing it in the home node's S-COMA cache 34. The S-COMA apparatus on the home 30 is primed to associate the home S-COMA cache line 34 address (RA) with the target line's global address (GA). The S-COMA apparatus on the client node 32 is primed to associate the client S-COMA cache line 34 address (RA') with the target line's global address.
When a client 32 attempts to reference global data which is not in the local L2 cache, the client's S-COMA cache 34 is checked. If the data is available there, the data is fetched from local memory (RA') and the request is finished. If the data is not present or is not in a valid state in the client's S-COMA cache 34, then the client's S-COMA dir.' 38 communicates with the home S-COMA dir. 38 to retrieve a valid copy of the data.
At each node of the internode communication, the S-COMA mechanism performs a boundary function, translating the relevant local real address (RA), (RA') of the S-COMA cache slot to a global address (GA). Note that each node 30, 32 can utilize a different real address for the target line's S-COMA cache slot, but all use the same global address for identifying a particular global line. In this way, independence is maintained between nodes.
An S-COMA caching system suffers the drawback of allocating cache slots in page increments, although coherence is performed on a cache line basis. If a process uses a large percentage of the memory in each page allocated in the S-COMA cache, S-COMA can provide advantages over NUMA by providing a great deal more caching capacity. However, if relatively little of each allocated page is used, then S-COMA wastes memory by allocating large cache slots which are inefficiently used.
As a further variation, a hybrid caching architecture together with a cache-coherent protocol for a multi-processor computer system is described in the above-incorporated U.S. Letters Patent No. 5,710,907 entitled "Hybrid NUMA COMA Caching System and Methods for Selecting Between the Caching Modes." In one implementation of this hybrid system, each subsystem includes at least one processor, a page-oriented COMA cache and a line-oriented hybrid NUMA/COMA cache. Each subsystem is able to independently store data in COMA mode or in NUMA mode. When caching in COMA mode, a subsystem allocates a page of memory space and then stores the data within the allocated page in its COMA cache. Depending on the implementation, while caching in COMA mode, the subsystem may also store the same data in its hybrid cache for faster access. Conversely, when caching in NUMA mode, the subsystem stores the data, typically a line of data, in its hybrid cache.
One drawback to the above-summarized hybrid system is that the system relies on an S-COMA coherence apparatus which is independent of and equal to a NUMA coherence apparatus. Two logically complete apparatuses are employed to implement the hybrid concepts described therein. Furthermore, both home and client nodes must maintain S-COMA caches for data and translate between global addresses and real addresses.
Notwithstanding existence of the above-summarized system variations, further enhancements to system memory architecture and in particular, to a hybrid architecture employing a first type of memory and a second type of memory are desired.