NUMA designs attempt to take advantage of memory hierarchy because a process suffers less memory access latency (and runs faster) the closer the processor running the process is to the memory serving the processor. If memory access requires using a system interconnect, the resulting latency is high, in some cases high enough to be prohibitive. In addition to memory access latency savings, maintaining a process's memory locally and providing local access reduces use of the system interconnect. Reducing use of the system interconnect allows either more CPU modules to be connected to a given interconnect and/or faster processor speeds relative to the interconnect's bandwidth. Therefore, the primary goal of the memory management software used in a NUMA design is to store a process's most frequently used pages as close to the processor running the process as possible. At the same time, the operating system must minimize the time and resources required to satisfy this goal.
Recently several experimental multiprocessor computer systems have been developed that attempt to reduce memory access latency by distributing system memory. Each of these systems includes multiple processors, each with a local memory interconnected by either a bus or a butterfly network and some amount of global memory. See Parallel Programming Using the MMX Operating System and Its Processor by E. Gabber in Proceedings of the Third Israel Conference Computer System Software Engineering, Tel-Aviv, Israel, Jun. 6-7, 1988, pp. 122-23; The Advanced Computing Environment Multiprocessor Workstation by A. Garcia, D. J. Foster and R. F. Freitas, IBM Research Report RC 14491 (#64901), IBM T. J. Watson Research Center, March 1989; Butterfly.TM. Parallel Processor Overview, BBN Report No. 6148, Version 1, Mar. 6, 1986; and The Uniform System Approach to Programming the Butterfly.TM. Parallel Processor, BBN Report No. 6149, Version 2, Jun. 16, 1986. As best understood, each of these systems requires explicit management of the local memory storage by either the complier or the application programmer. Memory management is not controlled by the operating system.
Apparently some recent work at the University of Rochester has taken the Mach operating system and modified its memory management to treat local memory as a cache of pages stored in global memory. See An Overview of PLATINUM: A Platform for Investigating Non-Uniform Memory by R. Fowler and A. Cox, University of Rochester Technical Report 262, November 1988; and The Implementation of a Coherent Memory Abstraction On an NUMA Multiprocessor: Experience with PLATINUM by A. Cox and R. Fowler, University of Rochester Technical Report 263, May 1989. Because the University of Rochester approach does not allocate storage directly in local memory, it does not realize some of the interconnect bandwidth the savings achieved by the hereinafter-described invention. Moreover, treating local memory as a cache of pages in global memory has a number of disadvantages. First, treating local memory as a cache means that data stored in global memory is replicated in local memory. Thus, local memory does not add to overall system memory. Further, replicating data in system memory creates a high memory overhead, particularly because system memory is stored on a page basis and page sizes are relatively large. Recently, page sizes of 64K bytes have been proposed. In contrast cache memories store data in blocks of considerably smaller size. A typical cache block of data is 64 bytes. Thus, the "granularity" of the data replicated in system memory is considerably larger than the granularity of data replicated in cache memory. The large granularity size leads to other disadvantages. Greater interconnect bandwidth is required to transfer larger data granules than smaller data granules. Coherency problems are increased because of the likelihood that more processors will be contending for the larger granules than the number contending for smaller granules on a packet-to-packet basis.
Recently a new NUMA design multiprocessor computer system incorporating coupled memory (hereinafter often abbreviated throughout this application as "CM") has been developed. In a CM multiprocessor computer system, physical system memory resides on both CPU and memory modules. Regardless of where located, all memory appears as one common physical address space and is accessible by all processors in the system. The part of system memory that physically resides on a CPU module is known as a coupled memory (or CM) region. Each processor accesses its coupled memory (called a local reference) via a private port. Access to the coupled memory on other modules (called remote references) is made via the system interconnect. The coupled memory of each CPU module is considered a separate CM region, while the memory of all of the memory only modules is grouped into one region known as global memory (often abbreviated hereinafter as GM). GM can contain shared data used by more than one processor and/or act as an overflow resource when the CM region of a CPU module is insufficient. Although this invention is not limited to use in a CM model of a NUMA design, this model is used throughout the following description to explain the invention's details.
The present invention is directed to managing the CM regions of a coupled memory multiprocessor computer system in a manner that maintains memory latency and system interconnect usage at a low level.