As increasingly large and complex software applications are developed for personal computers, a corresponding increase in computer performance is required to run these applications. Therefore, a principal area of research in the computer industry involves ways to increase computer system performance.
A computer system essentially comprises a microprocessor, computer memory, and various peripherals that are coupled to one or more common buses. Put simply, the memory stores program instructions and data, the microprocessor executes these program instructions to manipulate data or perform other operations, and the peripherals are used to display data and interact with the human user. While efforts are underway to increase the performance of virtually every aspect of the computer system, a primary area of research involves increasing the performance of the computer's microprocessor.
A brief discussion of the evolution of the Intel Corporation (Intel) 80X86 family of microprocessors is deemed appropriate. In 1981, International Business Machines Corp. (IBM) introduced its personal computer (PC). The IBM PC included an 8086 microprocessor from Intel which included a 16 bit data path and 20 address pins. Rather than incorporating 20 bit registers into the 8086, the 8086 used a segmented addressing scheme using a 16 bit segment register and 16 bit offset register. In order to generate a 20 bit address, the 16 bit segment and offset registers were loaded, the value in the segment register was shifted left four bit positions, and the offset was added to this value to produce the 20 bit address.
A later generation Intel microprocessor, the 80286 processor, included two modes of operation referred to as real mode and protected mode. In real mode, the 80286 emulated operation of the 8086 processor. Protected mode offered an entirely new segmentation scheme which allowed for the implementation of virtual memory, the use of privilege levels for memory protection, and a mechanism for separating memory assigned to different tasks in a multi-tasking environment. However, existing DOS applications could not be run in protected mode but rather were required to be run in real mode where they were still limited to one Mbyte of address space. Therefore, the next generation Intel processor, the 80386, introduced a new mode of operation referred to as V86 mode as well as a paging mechanism that could be used in addition to memory segmentation. The paging mechanism allowed DOS applications running in V86 mode to access extended memory, i.e., memory over one Mbyte.
To summarize, the Intel 80386 and later generation processors, the 486i and Pentium, include an on-chip memory management unit (MMU) which includes both segmentation and paging mechanisms. The address translation performed in the MMU allows implementation of virtual memory as well as various memory protection and separation features. For a more complete understanding of the problems solved by the present invention, a brief discussion of the MMU's operation follows. For more information on the operation of the MMU, please see the Intel Microprocessors Handbook, Vol. 1, 1993 edition, published by Intel Corporation.
MMU Address Translation
FIG. 1 illustrates the address translation performed by the MMU in protected mode. When an instruction requests the contents of a memory location, the instruction refers to the location not by an actual hardware or physical memory address, but by a virtual or logical address. The logical address must be translated into the appropriate physical memory address to access the desired location. As shown, the segmentation unit in the MMU translates the logical address into a linear address. If paging is not enabled, the linear address then becomes the physical address that is output from the processor to access the requested memory location, as shown. If paging is enabled, the paging mechanism further translates the linear address into a physical address which is then used to access the requested memory location.
1. Segmentation Unit
Referring now to FIG. 2, a more detailed illustration of the address translation that occurs in the MMU is shown. In protected mode, each block or segment of memory is described by a special structure called a segment descriptor. Segment descriptors reside in a set of system tables called descriptor tables. The CPU loads values referred to as a selector and offset in its segment and offset registers, respectively, and these values are used to access an address in a desired memory segment. In essence, the selector is a 16 bit value that serves as the virtual name for a memory segment, and the MMU uses the selector to index in the descriptor tables to the respective segment descriptor corresponding to the desired memory segment.
As shown in FIG. 3, a descriptor is a small (64 bit) block of memory that describes the characteristics of a much larger memory block or memory segment. The descriptor includes information regarding the segment's base address, its length or limit, its type, its privilege level and various status information. The segment's base address is the starting point in the segment's linear address space. As shown in FIG. 2, the offset portion of the logical address is added to the base address in the descriptor to generate the linear address of the desired memory segment. Among the status bits, a bit referred as the Accessed bit is automatically set by the CPU whenever a memory reference is made to the segment defined by the respective descriptor.
The Intel X86 family of processors also includes segment descriptor cache registers for each of its segment registers. Whenever a segment register's contents are changed, the 8-byte descriptor associated with that selector is automatically loaded (cached) in the respective segment descriptor cache register. This is referred to as a segment descriptor reload. Once loaded, all references to that segment use the cached descriptor information instead of reaccessing the descriptor from main memory.
When a memory access occurs, and the desired descriptor does not reside in a segment descriptor cache register, then the CPU is required to retrieve the descriptor from main memory. The CPU must also perform a locked read/write cycle to main memory to set the Accessed bit in the descriptor. Therefore, 3 cycles, 2 reads and a write, are required for every segment descriptor reload. This requirement reduces computer system performance, especially if the desired segment descriptor is not cached in the microprocessor cache and hence these three cycles must propagate to main memory. Therefore, a method and apparatus is desired to reduce the number of cycles required for segment descriptor reloads and hence increase computer performance.
2. Paging Mechanism
Referring again to FIG. 2, once the segmentation unit has translated the logical address into a linear address, the linear address is provided to the paging mechanism to be translated into a physical address, assuming paging is enabled. Referring now to FIG. 4, the CPU uses two levels of tables to translate the linear address (from the segmentation unit) into a physical address, these being the page directory and the page tables. The CPU also includes an internal register referred to as control register 3 (CR3) which contains the physical starting address of the page directory. As shown in FIG. 4, the linear address produced by the segmentation unit includes a directory field which stores an index to the page directory. The directory value in the linear address is combined with the page directory base address in CR3 to index to the desired entry in the page directory.
Referring now to FIG. 5, each page directory entry contains the base address of a respective page table as well as information about the respective page table. As shown in FIG. 4, the page table base address stored in the respective page directory entry (FIG. 5) is combined with a page table index value stored in bits 12-21 of the linear address to index to the proper page table entry.
As shown in FIG. 6, a page table entry contains the starting or base address of the page frame being accessed as well as statistical information about the page. As shown in FIG. 4, the page frame base address in the page table entry is concatenated with the lower 12 bits of the linear address, referred to as the offset, to form the physical address. The physical address is output from the pins of the CPU to access the desired memory location.
a. Page Directory/Table Entries
Referring again to FIGS. 5 and 6, the lower 12 bits of each page table entry and page directory entry contain statistical information about pages and page tables respectively. The P or Present bit, bit 0, indicates if a page directory or page table entry can be used in address translation. The A or Accessed bit, bit 5, is set by the processor for both types of entries before a read or write access occurs to an address covered by an entry. For a page table entry, the D or Dirty bit, bit 6, is set to 1 before a write to an address covered by that page table entry occurs. The D bit indicates that an address in a page has been updated with new data and is typically used by the operating system to write back dirty pages in case a page is being swapped out. The D bit is undefined for page directory entries. When the P, A and D bits are updated by the microprocessor, the processor generates a read-modify-write cycle which locks the bus to prevent conflicts with other processors or peripherals.
b. Translation Lookaside Buffer
The performance of the paging mechanism would degrade substantially if the processor was required to access two levels of tables for every memory access. To solve this problem and increase performance, the MMU paging mechanism utilizes an internal cache memory referred to as the Translation Lookaside Buffer (TLB) which stores the most recently accessed page table entries. The TLB is a four-way set associative cache, meaning that the cache includes four banks of memory where a page table entry can be stored. The TLB also includes some form of a least recently used (LRU) replacement algorithm for adding new page table entries if the TLB is currently full. The least recently used entry is replaced by a new entry because statistically the LRU entry is the least likely to be requested in the future. Therefore, the TLB automatically keeps the most commonly used page table entries stored in the processor.
When the MMU requests a page table entry and the entry resides in the TLB cache, then a TLB hit occurs, and the entry is retrieved from the TLB without requiring a bus cycle or table lookups. However, if the requested entry does not reside in the TLB cache, then the requested entry is retrieved from the page tables in system memory and placed in the TLB. This is referred to as a TLB reload.
c. Paging Mechanism Operation
Referring now to FIG. 7, the paging mechanism operates in the following fashion. When the paging mechanism receives a linear address from the segmentation unit, the upper 20 bits of the linear address are compared with the entries in the TLB to determine if there is a match. If there is a match, referred to as a TLB hit, then the 32-bit physical address is calculated and placed on the address bus. The physical address is calculated using the page frame base address stored in the page table entry and the offset from the linear address as described above.
If the requested page table entry is not in the TLB, then a TLB reload occurs. The CPU first reads the appropriate page directory entry from memory. If the Present bit in the page directory entry indicates that the page table is in memory, then the CPU sets the Accessed bit in the page directory entry using a read/write cycle, calculates the page table entry address, and reads the appropriate page table entry. If the Present bit in the page table entry indicates that the requested page frame is in main memory, then the processor updates the Accessed and/or Dirty bits as needed using a read/write cycle and performs the memory access. The page table entry is stored in the TLB for possible future accesses according to the LRU replacement algorithm described above. If the Present bit for either the page directory entry or the page table entry indicates that the respective page table entry or page frame is not in memory, then the processor generates a page fault, which potentially means that the requested entry or page frame must be swapped in from disk.
Therefore, reading a new entry into the TLB, referred to as a TLB reload or refresh, is a two-step process and the sequence of data cycles required to perform a TLB refresh is as follows. First, the CPU must read the correct page directory entry from memory. If the Present bit in the entry equals 1, then the CPU must perform a locked read/write cycle to set the Accessed bit in the directory entry. Therefore, the directory entry will actually get read twice and written to once if the CPU needs to set any of the status bits in the entry. The CPU then reads the correct entry in the page table. If the Present bit is 1, then the CPU places the entry in the TLB and then performs a locked read/write to set the Accessed and/or Dirty bits in the page table entry. Here again, the page table entry will actually get read twice and written to once if the CPU needs to set any of the bits in the entry.
Therefore, every time a TLB reload is required, up to six memory accesses, four reads and two writes, are necessary to perform the reload and allow a single memory access to occur. In a situation where a majority of the TLB accesses are misses, then thrashing occurs whereby the TLB is continually reloading in new page table entries. This can cause a performance degradation of up to six times. The performance degradation can actually be much worse because a particular piece of data cannot be accessed until the entire virtual to physical translation has been completed. Therefore, an improved method and apparatus is desired to reduce the number of processor bus cycles required during both segment descriptor and TLB reloads.