1. Field of the Invention
The present invention relates to translational look-aside buffers that are used for storing virtual memory address-to-physical memory address translations for a processor.
2. Related Art
Computer systems, including uni-processor computer systems and multiprocessor computer systems, typically run multiple processes or threads at a time. Each process requires some amount of physical memory. Often, physical memory is limited and must be allocated among the different processes.
In order to allocate limited physical memory among multiple processes, computer systems employ virtual memory schemes. Virtual memory schemes divide physical memory into pages and allocate the pages to the different processes. Physical memory that is so allocated is referred to as mapped memory.
In a virtual memory scheme, each process that is allocated a block of physical memory is also provided with a set of translations for translating virtual addresses to assigned physical addresses of the allocated block. Each set of translations can be stored in, for example, a page table. A page table can be associated with a specific user or shared by multiple users. Alternatively, reverse page table techniques can be employed.
Page tables are commonly indexed by virtual page numbers and include a page table,entry (PTE) for each virtual page address. If a virtual page is stored in memory, a corresponding PTE includes a physical address of the page and control information such as a valid bit, permission bits, etc. The PTE for a page can be found by looking at an index that corresponds to the virtual address. Page tables can be implemented as sparse arrays and are typically stored in main memory.
When a process requests access to a virtual memory address, a page table that is associated with the process is searched for the requested virtual address. When the virtual address is found, the process can access the desired page using the physical address in the PTE that is associated with the virtual address.
Computer systems typically employ one or more levels of cache memory between main memory and each processor in order to reduce memory access time. Cache memories store data that is retrieved from main memory. Data that is retrieved by a processor must pass through the one or more levels of cache in order to get to the processor. Because caches tend to be small and physically close to the processor, sometimes located on-chip with the processor, cached data can generally be accessed much faster than data that is stored in main memory. Thus, caches are typically used to store data that needs to be repeatedly accessed by a processor, such as PTE translations.
In addition to caching translations, most processors employ an on-chip, translational look-aside buffer (TLB) for storing a number of most recently used, virtual memory address-to-physical memory address translations. When a processor retrieves a translation from main memory or from cache, it stores the translation in an associated TLB. The processor can retrieve a translation from the TLB faster than from the cache or from main memory. Because TLBs tend to be small, storing for example, forty eight translations, processors can overwrite least recently used translations in a TLB with more recently used translations.
When the processor needs a translation, it first looks to the TLB. If a translation exists in the TLB, the processor retrieves the physical address from the TLB and accesses the data using the physical address. If the translation does not exist in the TLB (i.e, a TLB "miss"), the processor looks to the cache or main memory. These operations can be performed with hardware, software, firmware or any combination thereof.
One problem that confronts both TLBs and caches is maintaining consistency of data that is stored in more than one location. For example, a PTE can be retrieved from main memory and stored in both a processor cache and a TLB. Later, an operating system can change or invalidate the PTE. For example, in a distributed shared memory (DSM) system, data that is stored in the mapped physical memory location can be migrated or moved to another physical location. As a result of the migration, the PTE is no longer valid because it stores the physical address of the data prior to the migration. Thus, the operating system updates the PTE to reflect the new physical location of the data. The copy of the PTE that is stored in the processor cache and the TLB, however, is no longer valid.
In order to prevent processors from retrieving invalid translations from caches and TLBs, invalid translations must be flagged with an invalid bit or have a valid bit turned off. Alternatively, invalid translations can be updated with current, valid translations.
Systems and methods for maintaining cache consistency, including hardware and software methods, are well known. Cache consistency hardware schemes, or protocols, are divided into two main types, directory protocols and snooping protocols. In directory protocols, the sharing status of memory is maintained in a centralized directory. In a distributed shared memory (DSM) system, the directory can be distributed. A centralized controller is responsible for maintaining consistency of shared data. Any time that data stored in a memory location is changed, a check is performed in the centralized directory to determine whether a copy of the data is stored in a cache. If so, each copy is either updated or invalidated. For example, copies of a PTE stored in one or more caches could be invalidated by sending an invalidation signal and a page table entry address directly to each cache that stores a copy of the PTE. Directory protocols are highly scalable and are preferred in large multi-processing systems.
In snooping protocols, no central directory is maintained. Instead, each cache is responsible for the consistency of its data. Actions involving shared memory are broadcast to all caches. Each cache includes snooping hardware for snooping the broadcasts and for determining whether a broadcast requires that some action be taken on the cache. For example, a PTE stored in a cache could be invalidated by broadcasting an invalidation signal and a PTE address. The address snooping hardware for each cache receives the broadcast and searches the associated cache for the PTE address. If the PTE address is found, the associated translation is invalidated or updated. Because of the broadcast requirement, snooping protocols are generally implemented in shared bus architectures. Since the number of broadcasts generally increases with increasing caches, snooping protocols are not very scalable.
Software cache coherence schemes rely on both operating system and application programmers to insure consistency. It is not always possible for programmers to predict process execution and interaction, however, limiting the effectiveness of software techniques.
Additional details of cache consistency schemes are provided in, for example: Schimmel, "UNIX Systems for Modern Architectures, Symmetric Multiprocessing and Caching for Kernel Programmers,"pp.287-333, Addison Wesley, 1994; Hennessy and Patterson, Computer Architecture A Quantitative Approach, 2d Ed., pp. 655-693, Morgan Kaufmann, 1996; Tomasevic and Milutinovic, The Cache Coherence Problem in Shared-Memory Multiprocessor: Hardware Solutions, IEEE Computer Society Press, 1993; and Cache Coherency Problem in Shared Memory Multiprocessors: Software Solutions, edited by Milo Tomasevic and Veljko Milutinovic, Los Angeles, Calif., IEEE Computer Society Press, 1993, each of which is incorporated by reference herein.
In TLB consistency schemes, stale TLB entries are typically removed by broadcasting invalidation requests to all central processing units (CPUs) in a system via inter-CPU interrupts and then waiting for acknowledgments to come back from each CPU. This is almost universally performed with software. Few, if any systems, uses hardware. Interrupting the operating system or the CPU each time that a TLB entry needs to be invalidated, however, reduces the time that can be spent performing other tasks.
Alternatively, invalidation requests can be sent only to CPUs where the process has executed. In either situation, substantial communication and synchronization overhead is incurred. Moreover, operating systems can invalidate every entry in a TLB rather than just the stale entries. This results in additional TLB miss overhead for the entries that were valid.
What is needed is a hardware-based, system and method for updating a translational look-aside buffer when PTE in a cache is updated, that does not interrupt CPU processing or the operating system and does not invalidate valid TLB entries. A system and method for updating TLBs is needed which reduces communication and synchronization overhead.