1. Field of the Invention
The present invention relates generally to a system and method for cache coherency, and more particularly to a system and method for maintaining coherency of virtual-to-physical memory translations in a multiprocessor computer.
2. Related Art
Maintaining the coherence of virtual translation caches (such as translations stored in translation-lookaside buffers (TLBs)) is usually handled in software rather than hardware in most distributed, shared-memory multiprocessors. This is based on a number of reasons.
Translation updates (i.e., TLB coherence actions) are much less frequent than updates to memory locations (normal memory coherence actions). These translation updates are often associated with paging, which also involves a relatively expensive read or write, or both to a disk. This disk access dominates the cost of maintaining coherence. Most memory pages are private to individual processes and processors so TLB updates can often be done by simply purging a translation entry in a single processor's TLB. In many architectures (including MIPS), translation table updates, which comprise loads and purges, are handled in software anyway, so it is more natural to handle TLB coherence in software as well.
It should also be noted that much of the cost of software TLB coherence in a multiprocessor is the need to synchronize (i.e., interrupt) all the processors who need to have entries in their TLB invalidated or updated.
In a large-scale non-uniform memory architecture (NUMA) many of these conditions do not hold. A NUMA computer system typically includes a plurality of processing nodes each having one or more processors, a cache connected to each processor, and a portion of main memory that can be accessed by any of the processors. The main memory is physically distributed among the processing nodes. In other words, each processing node includes a portion of the main memory. At any time, data elements stored in a particular main memory portion can also be stored in any of the caches existing in any of the processing nodes.
Most importantly, the NUMA architecture implies that translation may change due to migrating data from the memory of one node to a node that contains the processor referencing the data more frequently. This can cause the rate of translation updates to increase over traditional systems. Further, the non-TLB update costs decrease since the data moves only from one memory to another, not to disk. Thus, in a NUMA system it is desirable to have hardware acceleration of TLB coherence. Furthermore, since most processors (including all the MIPS processors) do not support TLB coherence in hardware, it is desirable for TLB coherence to be managed outside of the processor, and to remove the need for inter-processor synchronization in the updating of the TLBs.
There are a number of schemes that have been described in the literature for maintaining TLB coherence. (See Teller et al., "Translation-Lookaside Buffer Consistency", IEEE Computer, June 1990). Teller's TLB validation algorithm maintains a generation count in memory, which is incremented when a page table translation update is made. Along with each memory access, the processor includes its TLB copy of the generation count and memory compares these. If the two match, then the translation is valid and the access is validated. If the generation counts do not match, then the processor is notified to invalidate the appropriate TLB entry. An advantage of Teller's scheme is that it does not have problems reclaiming stale pages. When reused, the new translation starts with the next generation count for that physical page (frame is just another name for a page-sized portion of main memory). It does have a problem of needing to purge translations when a given generation counter overflows.
A bus error scheme is used in the Wisconsin Wind Tunnel (WWT) (Reinhardt et al., "The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers", ACM SIGMETRICS Conference Proceedings, May 1993) for triggering memory and coherence operations in software. In the WWT design, the error correction code (ECC) for a given memory word is corrupted to cause a bus error, which subsequently invokes software that maintains the illusion of coherent shared-memory on top of a message-passing system (i.e., a CM-5 computer: "Connection Machine CM-5: Technical Summary", Thinking Machines Corporation, November 1993). This scheme, however, is for cache coherency and does not address the problem of maintaining virtual-to-physical memory translations.
The problem not addressed by art is how to minimize costly synchronization of the processors. Thus, what is required is an improved mechanism to handle virtual-to-physical memory translations for maintaining coherency in a distributed computer system that results in minimal if any system performance degradation, and that requires minimal if any additional storage overhead.