NUMA designs attempt to take advantage of memory hierarchy because a process suffers less memory access latency (and runs faster) the closer the processor running the process is to the memory serving the processor. If memory access requires using a system interconnect, the resulting latency is high, in some cases high enough to be prohibitive. In addition to memory access latency savings, maintaining a process's memory locally and providing local access reduces use of the system interconnect. Reducing use of the system interconnect allows either more processors to be connected to a given interconnect and/or faster processor speeds relative to the interconnect's bandwidth. Therefore, the primary goal of the memory management software used in a NUMA design is to store a process's most frequently used pages as close to the processor as possible. At the same time, the operating system must minimize the time and resources required to satisfy this goal.
Recently several experimental multiprocessor computer systems have been developed that attempt to reduce memory access latency by distributing system memory. Each of these systems includes multiple processors, each with a local memory interconnected by either a bus or a butterfly network and some amount of global memory. See Parallel Programming Using the MMX Operating System and Its Processor by E. Gabber in Proceedings of the Third Israel Conference Computer System Software Engineering, Tel-Aviv, Israel, Jun. 6-7, 1988, pp. 122-23; The Advanced Computing Environment Multiprocessor Workstation by A. Garcia, D. J. Foster and R. F. Freitas, IBM Research Report RC 14491 (#64901), IBM T. J. Watson Research Center, March 1989; Butterfly.TM. Parallel Processor Overview, BBN Report No. 6148, Version 1, Mar. 6, 1986; and The Uniform System Approach to Programming the Butterfly.TM. Parallel Processor, BBN Report No. 6149, Version 2, Jun. 16, 1986. As best understood, each of these systems requires explicit management of the local memory storage by either the complier or the application programmer. Memory management is not controlled by the operating system. Apparently some recent work at the University of Rochester has taken the Mach operating system and modified its memory management to treat local memory as a cache of pages stored in global memory. See An Overview of PLATINUM: A Platform for Investigating Non-Uniform Memory by R. Fowler and A. Cox, University of Rochester Technical Report 262, November 1988; and The Implementation of a Coherent Memory Abstraction On an NUMA Multiprocessor: Experience With PLATINUM by A. Cox and R. Fowler, University of Rochester Technical Report 263, May 1989. Because the University of Rochester approach does not allocate storage directly in local memory, it does not realize some of the interconnect bandwidth the savings achieved by the hereinafter-described invention. Moreover, treating local memory as a cache of pages in global memory has a number of disadvantages. First, treating local memory as a cache means that data stored in global memory is replicated in local memory. Thus, local memory does not add to overall system memory. Further, replicating data in system memory creates a high memory overhead, particularly because system memory is stored on a page basis and page sizes are relatively large. Recently, page sizes of 64K bytes have been proposed. In contrast cache memories store data in blocks of considerably smaller size. A typical cache block of data is 64 bytes. Thus, the "granularity" of the data replicated in system memory is considerably larger than the granularity of data replicated in cache memory. The large granularity size leads to other disadvantages. Greater interconnect bandwidth is required to transfer larger data granules than smaller data granules. Coherency problems are increased because of the likelihood that more processors will be contending for the larger granules than the number contending for smaller granules on a packet-to-packet basis.
Recently a new NUMA design microprocessor computer system incorporating coupled memory (sometimes hereinafter abbreviated throughout this disclosure as CM) has been developed. In a CM system, physical system memory resides on both CPU and memory-only modules. Regardless of where located, all memory appears as one common physical address space and is accessible by all processors of the system. The part of system memory that physically resides on a CPU module is known as a coupled memory or CM region. In addition to a CM region, the CPU modules each include a processor. Each processor accesses its coupled memory region (called a local reference) via a private port. The coupled memory regions of other modules and the memory-only modules (called remote references) are made via the system interconnect. The coupled memory region of each CPU module is considered separate, while the memory of all of the memory-only modules are grouped into one region known as global memory (sometimes abbreviated hereinafter as GM). Each CM region stores data of primary interest to the processor associated with that CM region, i.e., the processor of the same CPU module. This data primarily comprises the data and stack pages of processes assigned to the local processor. GM contains shared data that is used by more than one CPU and/or acts as an overflow resource when the CM region of a CPU module is insufficient. Although this invention is not limited to use in a CM model of a NUMA design, this model is used throughout the following description to explain the invention's details.
More details of a CM multiprocessor computer system are described in various patent applications filed before or contemporaneously with this application, namely U.S. patent application Ser. No. 07/649,844, entitled "Affinity Scheduling of Processes on Symmetric Multiprocessing Systems," filed Feb. 1, 1991; U.S. patent application Ser. No. 07/673,766, entitled "Coupled Memory Multiprocessor Computer System Including Cache Coherency Management Protocols" filed Mar. 20, 1991; U.S. patent application Ser. No. 07/673,132, entitled "Memory Management Method for Coupled Memory Multiprocessor Systems," filed Mar. 20, 1991, the subject matter of which is incorporated herein by reference. As described in those applications, one of the desires of CM multiprocessor computer systems is to maintain in the CM regions of each CPU module the data most frequently used by the processor of that CPU module. This is done in order to minimize use of the system interconnect and, thus, allow more processors or faster processors to be used with the same capacity interconnect. Unfortunately, all pages stored in the CM regions and in the GM regions are not used with equal frequency. Some memory pages are used more frequently than others. A remote data or stack page, i.e., a data or stack page stored in global memory, that is referenced once per second does not require the same interconnect bandwidth nor cause the same access latency delays as does a remote data or stack page referenced 10,000 times per second. Clearly, if there is only room for one more page in the CM region of the CPU module requiring access to these two pages, the choice is to bring in the more frequently referenced page, i.e., the data or stack page referenced 10,000 times per second. Unfortunately, the memory page references of any specific process are impossible to exactly predict. The present invention is directed to a method of adapting the memory of a coupled memory multiprocessor computer system to the dynamic needs of changing processes.