The present invention relates generally to high performance parallel computer systems and more particularly to dynamic page placement in cache coherent non-uniform memory architecture systems.
Many high performance parallel computer systems are built as a number of nodes interconnected by a general interconnect network (e.g., crossbar and hypercube), where each node contains a subset of the processors and memory in the system. While the memory in the system is distributed, several of these systems (called NUMA systems for Non-Uniform Memory Architecture) support a shared memory abstraction where all the memory in the system appears as a large memory common to all processors in the system.
These systems have to address the problem of where to place physical pages within the distributed memory system since the local memory is close to each processor. Any memory that is not local to the processor is considered remote memory. Remote memory has a longer access time than local memory, and different remote memories may have different access times. With multiple processors sharing memory pages and a finite size memory local to each processor, some percentage of the physical pages required by each processor will be located within remote physical memory. The chances that a physical page required by a processor is in local memory can be improved by using static page placement of physical memory pages.
Static page placement attempts to locate each physical memory page in the memory that causes the highest percentage of memory accesses to be local. Optimal physical memory page placement reduces the average memory access time and reduces the bandwidth consumed inside of the processor interconnect between processor nodes where there is uniform memory access time. The static page placement schemes include Don""t Care, Single Node, Line Interleaved, Round Robin, First Touch, Optimal, etc., which are well known to those skilled in the art.
Dynamic page placement may be used after the initial static page placement to replicate or migrate the memory page to correct the initial placement or change the location due to changes in the particular application""s access patterns to the memory page. The page placement mechanism, which is involved in the decision and copying/movement of the physical pages, may be in the multi-processor""s operating system (OS) or in dedicated hardware.
A replication is the copying of a physical page so that two or more processors have a local copy of the page. As long as the memory accesses are reads, multiple copies of data can be allowed without causing coherence difficulties. As soon as a write to the page is sent to the memory system, either all copies of the page must be removed or an update coherence algorithm must be in place to make sure all of the pages have the same data.
A page migration is the movement of a physical memory page to a new location. The migration is usually permanent and does not require special handling as is required for writes to replicated pages.
An approach to dynamic page placement is described in the paper by Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum, xe2x80x9cOperating System Support for Improving Data Locality on CC-NUMA Compute Serversxe2x80x9d, In ASPLOS VII, Cambridge, Mass., 1996.
In the past, dynamic page placement has been implemented by using a group of history counters for each memory page to keep track of how many accesses are made from each processor. It should be noted that memory accesses from a processor can also be thought of as cache misses to the processor""s lowest level of cache. When these counters reach the preset thresholds discussed later, the page placement mechanism is made aware that there is a page in memory that has enough data on how it is accessed for the page placement mechanism to determine the optimal page placement. Once the optimal page placement has been determined, the page in memory can be migrated or replicated to the optimal uniform memory access (UMA) cell. A UMA cell is a grouping of memories which can be accessed by processors in the multi-processor system with the same access latency.
Dynamic page placement increases memory locality and therefore reduces latency and network traffic to improve performance. However, the technique performs best with a small page size. Even for an application like a database with a large data footprint, the standard buffer size is 2K bytes. With 2K byte data structures, there will be 512xc3x971024 data structures in a 1-gigabyte page. With this many data structures on a single page, it is desirable to arrange the structures in such a way as to maximize locality to increase system performance. At worst case, with poor locality, the number of local memory accesses will be 1 divided by the number of UMA cells in the DSM system. For a 128-processor system with 2 processors per UMA cell, at worst case only {fraction (1/64)}th of the memory accesses are local.
In addition, with so many data structures located on a single memory page, it is likely that all processors will be accessing each physical page equally. Therefore, static and dynamic page placement techniques will be unable to find a UMA cell to place a page to maximize local memory accesses and therefore improve performance. Further, memory page hotspotting will occur. Hotspotting is the creation of a bandwidth bottleneck due to multiple processors attempting to access the same memory structure at the same time.
Working against small page sizes is the fact that many current processors only contain 96 to 128 entry translation look-aside buffers (TLBs), which are the processor caches that translate virtual to physical addresses and keep track of recently used translations of virtual page numbers. The small size of the TLB requires large pages for good performance when running multiple applications or applications with large data or instruction footprints, for example 1-gigabyte physical pages.
To track the changes in the application""s access patterns to the memory page, histories need to be maintained for every page in memory. A set of history counters is located close to the memory system for every physical page in memory and one counter is required for every UMA cell in the multi-processor system. Whenever a memory access is generated from a processor within a UMA cell, the counter representing the page and the cell for the processor performing the access is incremented.
There are two solutions for locating the counters: either within the memory itself or located in a separate hardware structure, such as the memory controller or the directory controller. Placing the counters within the memory has the advantage of keeping the cost down by using the existing DRAM in memory and the number of counters are automatically scaled with the installation of more memory. Unfortunately, this placement has the disadvantage of halving the memory bandwidth because of the accessing and updating of the counters. Placing the counters outside of memory adds a significant amount of hardware to the system because the hardware must be designed for the maximum amount of installable memory and also for the minimum physical page size.
Those skilled in the art currently teach that the future of multiprocessor systems lies in increasing the physical page size to offset the availability of only 96 to 128 TLB entries per processor rather than using a small page size to improve the percentage of local memory accesses. This is due to the long latency of handling TLB misses.
The present invention provides a multiprocessor system, where the latencies to access areas of memory have different values, with the capability of having the operating system use large page sizes while dynamic page placement manipulates subsets of the large pages without affecting the translation look-aside buffers of the processors. A sub-page support structure is inserted between the processor and the network interface to remote memory that on a remote memory access determines if a local copy of the data exists and, if it does, it changes the remote access to a local access.
The present invention also provides a sub-page support structure having history counters which instructs the processor of a new memory location or passes an access along to the correct UMA cell when a sub-page in a remote memory has been migrated to a third UMA cell.
The present invention further provides a system for dynamic page placement with physical memory sub-pages which consumes less bus bandwidth by moving smaller pages. Currently, those skilled in the art teach that system improvements can be best achieved by increasing physical page size.
The present invention further provides a system for dynamic page placement with physical memory sub-pages which alleviates false sharing within a memory page between processors.
The present invention further provides a system for dynamic page placement with physical memory sub-pages which alleviates the too few translation look-aside buffer entry problems.
The present invention further provides a system for dynamic page placement with physical memory sub-pages which removes the negative aspects of large page sizes.
The present invention further provides a system for dynamic page placement with physical memory sub-pages which does not prevent the entire page from being replicated or migrated.
The present invention further provides a system for dynamic page placement with physical memory sub-pages which eliminates updating of the translation look-aside buffer entries for sub-page migration and replication.
The above and additional advantages of the present invention will become apparent to those skilled in the art from a reading of the following detailed description when taken in conjunction with the accompanying drawings.