Microprocessors, including those of the X86 and Pentium families of processors available from Intel, Inc., execute instructions and manipulate data stored in a main memory, typically some amount of dynamic random-access memory, or DRAM. Modern processors execute instructions far faster than instructions and data can be made available by reasonably priced DRAM. DRAM access times thus adversely affect processor performance.
Cache memory offers the most common solution to the DRAM bottleneck. Modern processors still use relatively slow and inexpensive DRAM for main memory, but also include a smaller amount of fast, expensive static RAM (SRAM) cache memory. The SRAM cache maintains copies of frequently accessed information read from DRAM. The processor then looks for instructions and data in the cache memory before resorting to the slower main memory.
Modern computer systems must typically reference a large number of stored programs and associated program information. The size of this information necessitates an economical mass storage system, which is typically comprised of magnetic disk storage. The access time of this mass storage is very long compared to access times of semiconductor memories such as SRAM or DRAM, motivating the use of a memory hierarchy. The concept of virtual memory was created to simplify addressability of information within the memory hierarchy and sharing of information between programs. The following is a formal definition of the term xe2x80x9cvirtual memory,xe2x80x9d provided in a classic text on the subject:
xe2x80x9cVirtual memory is a hierarchical storage system of at least two levels, managed by an operating system (OS) to appear to the user as a single, large, directly-addressable main memory.xe2x80x9d
Computer Organization, 3rd ed., V. C. Hamacher, Z. G. Vranesic, S. G. Zacky, McGraw-Hill, New York, 1990). Further elaboration is provided in another commonly referenced text:
xe2x80x9cThe main memory can act as a xe2x80x98cachexe2x80x99 for the secondary storage, usually implemented with magnetic disks. This technique is called virtual memory. There are two major motivations for virtual memory: to allow efficient and safe sharing of memory among multiple programs and to remove the programming burden of a small, limited amount of main memory.xe2x80x9d
Computer Organization and Design: The Hardware/Software Interface, 2nd edition, David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 1998.
The upper levels of modern memory hierarchies typically include cache and main memory. xe2x80x9cCache is the name first chosen to represent the level of memory hierarchy between the CPU and main memoryxe2x80x9d (Computer Architecture, a Quantitative Approach, by Hennessy and Patterson, 1990, p408). xe2x80x9cMain memory satisfies the demands of caches and vector units, and serves as the I/O interface as it is the destination of input as well as the source for outputxe2x80x9d (Id. at p. 425). Most main memories are composed of dynamic random-access memories, or DRAMs, while most caches are relatively faster static random-access memories, or SRAMs (Id. at p. 426). Most. modern systems subdivide the memory into pages (commonly, 4KB in size), and the OS swaps pages between the main memory and the disk storage system based on an appropriate page allocation and replacement scheme.
The virtual memory is addressed by virtual addresses, which must be translated to physical addresses before cache or main-memory accesses can occur. The translation is typically performed by an address translation unit in the processor, which accesses address translation information stored in the main memory. In an x86 architecture processor, the address translation information is stored hierarchically in the form of a Page Directory consisting of multiple Page Directory Entries (or PDES). Each PDE, in turn, references a Page Table consisting of multiple Page Table Entries (or PTEs). Each PTE, in turn, contains the physical address and attribute bits of the referenced page or Page Frame. For the specification of this invention, the translation information will be referred to herein generically as xe2x80x9caddress translation informationxe2x80x9d (ATI) and the structures used to store this information will be referred to herein as xe2x80x9caddress translation tables.xe2x80x9d The terms xe2x80x9cpage tables,xe2x80x9dxe2x80x9cpage directories,xe2x80x9d or xe2x80x9cpage tables and page directoriesxe2x80x9dmay be used interchangeably with xe2x80x9caddress translation tables.xe2x80x9d
Address translation tables are stored in main memory. Address translations that must reference this information thus suffer the same speed penalty as other references to main memory: namely, the CPU must wait many clock cycles while the system produces the physical address associated with a corresponding virtual address. Once again, cache memory offers the most common solution to the DRAM bottleneck. In this case, however, the cache is an address translation cache that stores the most commonly referenced set of virtual page addresses and the physical page address associated with each stored virtual page address. Using this scheme, the vast majority of address translations can be accomplished without the speed penalty associated with a request from main memory by providing the required physical address directly from the address translation cache after a small lookup time. Address translation caches are commonly referred to as translation look-aside buffers (TLBs), page translation caches (PTCs) or xe2x80x9ctranslation buffersxe2x80x9d (TBs). The term TLB will be used throughout the remainder of this specification to represent the aforementioned type of address translation cache. Many CPUs include more than one TLB due to a variety of reasons related to performance and implementation complexity.
Conventional microprocessor/main memory combinations are well understood by those of skill in the art. The operation of one such combination is nevertheless described below to provide context for a discussion of the invention.
FIG. 1 depicts a portion of a conventional computer system 100, including a central processing unit (CPU) 102 connected to a memory controller device 104 via a system bus 106. The memory controller device 104 acts as a bridge between CPU 102 and main memory 108. Other terms are often used in the computer industry to describe this type of bridge device, including xe2x80x9cnorth bridge,xe2x80x9d xe2x80x9cmemory controller hub,xe2x80x9d or simply xe2x80x9cmemory controller.xe2x80x9d This device is often sold as part of a set of devices, commonly referred to as the system xe2x80x9cchip set.xe2x80x9d Throughout this specification, the term xe2x80x9cmemory controller devicexe2x80x9d will be used to refer to the device that serves as the main memory bridge, while the term xe2x80x9cmemory controllerxe2x80x9d will refer more narrowly to the block of logic which controls main memory access.
Memory controller device 104 is connected to a main memory 108 via a communication port 110 and to an IO controller 132 via an IO controller interface 150. Other interfaces may be optionally provided as part of memory controller device 104, but those interfaces are beyond the scope of this specification. System bus 106 conventionally includes address lines 140, data lines 142, and control lines 144. Communication port 110 likewise includes main-memory address lines, data lines, and control lines. Most interfaces also include a synchronization mechanism consisting of one or more clocks or strobes, although, for simplicity, these clocks are not shown in the figures herein.
IO controller 132 interfaces to peripherals 112 via one or more peripheral interfaces 114. Peripherals might comprise one or more of the following: a keyboard or keyboard controller, hard disk drive(s), floppy disk drive(s), mouse, joystick, serial I/O, audio system, modem, or Local Area Network (LAN). Peripherals are mentioned here for clarification purposes although the specific set of peripherals supported and means of interfacing to them are omitted for brevity.
CPU 102 includes a CPU core 116, which includes an address generation unit 118, an address translation unit 122, a bus unit 124, and a cache memory 126. Address generation unit 118 represents the unit or units that generate virtual addresses, which may include address calculation unit(s), load/store unit(s), instruction prefetch unit(s), data prefetch unit(s), or other sources. Cache memory 126 represents a simplified view of the on-chip cache hierarchy. Modern processors typically use a 2-level on-chip cache hierarchy consisting of a Level 1 cache and Level 2 cache, although fewer or more levels may be used. Level 1 caches are commonly split between instruction and data caches, while Level 2 caches are more commonly unified to contain a combination of instructions and data. Each physical cache is further subdivided into one or more cache tag sections, cache control sections, and data storage sections. FIG. 1 abstracts this level of detail to simply show the cache tags and control separate from the data section. As with most modern processors, CPU 102 includes a TLB 120 for storing the most commonly referenced virtual page addresses and their corresponding physical page addresses to greatly reduce the need to refer to address translation tables stored in main memory. Some CPUs include multiple TLBs, in some cases, separating instruction TLBs from data TLBs.
TLB 120 conveys physical addresses PA directly to cache memory 126 and to memory controller device 104 via bus unit 124 and system bus 106. Memory controller device 104 includes a memory controller 134 that translates physical addresses PA to a main-memory addresses MA suitable for accessing data and instructions in a portion 135 of main memory 108. Main memory 108 is typically DRAM. As mentioned previously, main memory 108 includes address translation tables 136 that store the requisite information for translating virtual page addresses into physical page addresses (VA- greater than PA).
CPU core 116 executes instructions and manipulates data obtained in a portion 135 of main memory 108 using a series of memory references. Fetching instructions and reading from or writing to main memory 108 requires a bus transaction, during which bus unit 124 communicates with memory controller device 104 to read from or write to main memory 108.
Address generation unit 118 of CPU core 116 presents a virtual page address VA to TLB 120 and directly or indirectly to address translation unit 122. If the specified virtual page address and an associated physical page address PA are stored in TLB 120, then TLB 120 presents the corresponding physical page address PA together with a page offset to cache tags and control circuitry 128. There is a relatively high probability that the requested data or instruction resides in the on-chip cache 126. For cache hits, the requested memory reference is returned to the CPU core 116 via an instruction/data path 133. Cache hit/miss status is returned to bus unit 124 via the cache-miss signal C_M to indicate whether a bus transaction is necessary across system bus 106.
If CPU core 116 presents a virtual page address VA for which there is no corresponding address translation in TLB 120, then TLB 120 issues a TLB miss signal on line TLB_M. Address translation unit 122 responds by requesting a bus transaction from bus unit 124 to retrieve address translation information (ATI) from an address translation table section 136 of main memory 108. Address translation unit 122 provides an address translation address ATA to memory controller device 104 via bus unit 124 and system bus 106. The address translation address ATA typically identifies the location or a pointer to the location of the address translation table entry containing the physical page address of the requested memory reference. The number of levels of indirection or levels of address translation table hierarchy is implementation-dependent.
Memory controller 134 converts the address translation address ATA to a main-memory address MA, typically consisting of device/rank, bank, row, and column address fields, to access the requested address translation information ATI. The translation of a physical address PA to a main-memory address MA is generally dependent upon the installed memory configuration. Most computer systems support a variety of possible memory configurations and provide a mechanism to communicate the installed configuration information to the system for proper addressability.
A memory request is issued from memory controller 134 containing main-memory address MA. Main memory 108 then returns the appropriate address translation information ATI stored at address MA to address translation unit 122 via memory controller device 104, system bus 106, and bus unit 124. Address translation unit 122 computes the appropriate physical page address PA using the ATI and stores the result in TLB 120, potentially replacing a previous TLB entry. The translation is then available from TLB 120 for presentation to cache 126. The translated physical address could also be delivered directly from the address translation unit 122 to cache memory 126 to reduce access latency.
If the requested data or instruction is in cache 126, then, as before when TLB 120 contained the appropriate virtual-to-physical page address translation, cache 126 presents the requested data or instruction to CPU core 116. This type of memory access requires at least one bus transaction to obtain the address translation information ATI, and is therefore significantly slower then the scenario in which the translation existed in TLB 120.
Slower still, the requested virtual page address VA may be absent from TLB 120 and the requested data or instruction information may be absent from cache 126. Such a condition requires at least two bus transactions or series of transactions. The first bus transaction or series of transactions obtains the address translation information corresponding to the requested virtual page address VA. The second bus transaction or series of transactions retrieves the requested data or instruction in main memory 105 that will be returned to CPU core 116 and cache 126. Once again, memory controller 134 must translate the physical page address PA into a main-memory address MA before the requested information can be accessed in main memory 108. The latency associated with this translation typically degrades system performance. Furthermore, the number of possible physical-to-main memory address mappings supported by modern memory controllers must be kept reasonably small to avoid further increases to address translation latency. A mechanism or technique that could reduce the average latency associated with physical-to-main-memory address translation while potentially increasing the flexibility of address mapping support would therefore be very desirable.
The present invention is directed to a processor that speeds references to main memory by storing main-memory addresses in a TLB. As with conventional TLBs, a TLB in accordance with the invention includes a number of entries, each including a virtual-address field and a physical-address field. Unlike conventional TLBs, however, each entry of the TLB of the present invention additionally includes a main-memory address field.
In the event of a TLB hit coupled with a cache miss, the processor passes the main-memory address in the TLB directly to the memory controller, avoiding the latency normally associated with systems that must translate a physical page address to a main-memory address before accessing information from main memory. The use of this type of mechanism generally allows greater flexibility in performing the physical-to-main memory address translation because translation latency is no longer in a latency-critical path.
The claims, and not this summary, define the scope of the invention.
FIG. 1 depicts a portion of a conventional computer system 100.
FIG. 2 depicts a computer system 200 configured in accordance with the invention to speed access to data and instructions in main memory in the event of a cache miss.
FIG. 3 graphically depicts one embodiment of an address translation scheme used in accordance with system 200 of FIG. 2.
FIG. 4 shows an example of a flexible physical-to-main-memory address translation applied according to region, where each region represents a range of physical addresses.