1. Field of the Invention
The present invention relates to computer systems using a bus bridge(s) to interface a central processor(s), video graphics processor(s), random access memory and input-output peripherals together, and more particularly, in utilizing a graphics address remapping table (GART table) for remapping non-contiguous physical memory pages into contiguous accelerated graphics port (AGP) device addresses, wherein selected entries of the GART table are cached to speed up the remapping process such that when a GART table entry is retrieved from the computer system random access memory a plurality of GART table entries are retrieved using the same memory access.
2. Description of the Related Technology
Use of computers, especially personal computers, in business and at home is becoming more and more pervasive because the computer has become an integral tool of most information workers who work in the fields of accounting, law, engineering, insurance, services, sales and the like. Rapid technological improvements in the field of computers have opened up many new applications heretofore unavailable or too expensive for the use of older technology mainframe computers. These personal computers may be stand-alone workstations (high end individual personal computers), desk-top personal computers, portable lap-top computers and the like, or they may be linked together in a network by a "network server" which is also a personal computer which may have a few additional features specific to its purpose in the network. The network server may be used to store massive amounts of data, and may facilitate interaction of the individual workstations connected to the network for electronic mail ("E-mail"), document databases, video teleconferencing, white boarding, integrated enterprise calendar, virtual engineering design and the like. Multiple network servers may also be interconnected by local area networks ("LAN") and wide area networks ("WAN").
A significant part of the ever increasing popularity of the personal computer, besides its low cost relative to just a few years ago, is its ability to run sophisticated programs and perform many useful and new tasks. Personal computers today may be easily upgraded with new peripheral devices for added flexibility and enhanced performance. A major advance in the performance of personal computers (both workstation and network servers) has been the implementation of sophisticated peripheral devices such as video graphics adapters, local area network interfaces, SCSI bus adapters, full motion video, redundant error checking and correcting disk arrays, and the like. These sophisticated peripheral devices are capable of data transfer rates approaching the native speed of the computer system microprocessor central processing unit ("CPU"). The peripheral devices' data transfer speeds are achieved by connecting the peripheral devices to the microprocessor(s) and associated system random access memory through high speed expansion local buses. Most notably, a high speed expansion local bus standard has emerged that is microprocessor independent and has been embraced by a significant number of peripheral hardware manufacturers and software programmers. This high speed expansion bus standard is called the "Peripheral Component Interconnect" or "PCI." A more complete definition of the PCI local bus may be found in the PCI Local Bus Specification, revision 2.1; PCI/PCI Bridge Specification, revision 1.0; PCI System Design Guide, revision 1.0; PCI BIOS Specification, revision 2.1, and Engineering Change Notice ("ECN") entitled "Addition of `New Capabilities` Structure," dated May 20, 1996, the disclosures of which are hereby incorporated by reference. These PCI specifications and ECN are available from the PCI Special Interest Group, P.O. Box 14070, Portland, Oreg. 97214.
A computer system has a plurality of information buses (used for transferring instructions, data and address) such as a host bus, a memory bus, at least one high speed expansion local bus such as the PCI bus, and other peripheral buses such as the Small Computer System Interface (SCSI), Extension to Industry Standard Architecture (EISA), and Industry Standard Architecture (ISA). The microprocessor(s) of the computer system communicates with main memory and with the peripherals that make up the computer system over these various buses. The microprocessor(s) communicates to the main memory over a host bus to memory bus bridge. The peripherals, depending on their data transfer speed requirements, are connected to the various buses which are connected to the microprocessor host bus through bus bridges that detect required actions, arbitrate, and translate both data and addresses between the various buses.
Computer systems typically utilize at least one "cache memory" for improved performance. In common usage, the term "cache" refers to a hiding place. The name "cache memory" is an appropriate term for this high speed memory that is interposed between a processor, or bus agent, and main memory because cache memory is hidden from the user or programmer, and thus appears to be transparent. Cache memory, serving as a fast storage buffer between the processor, or bus agent, and main memory, is not user addressable. The user is only aware of the apparently higher-speed memory accesses because the cache memory is satisfying many of the requests instead of the slower main memory.
Cache memory is smaller than main memory because cache memory employs relatively expensive high speed memory devices, such as static random access memory ("SRAM") Therefore, cache memory typically will not be large enough to hold all of the information needed during program execution. As a process executes, information in the cache memory must be replaced, or "overwritten" with new information from main memory that is necessary for executing the current process(es).
Information is only temporarily stored in cache memory during execution of the process(es). When process data is referenced by a processor, or bus agent, the cache controller will determine if the required data is currently stored in the cache memory. If the required information is found in cache memory, this is referred to as a "cache hit." A cache hit allows the required information to be quickly retrieved from or modified in the high speed cache memory without having to access the much slower main memory, thus resulting in a significant savings in program execution time. When the required information is not found in the cache memory, this is referred to as a "cache miss." A cache miss indicates that the desired information must be retrieved from the relatively slow main memory and then placed into the cache memory. Cache memory updating and replacement schemes attempt to maximize the number of cache hits, and to minimize the number of cache misses.
A cache memory is said to be "direct mapped" if each byte of information can only be written to one place in the cache memory. The cache memory is said to be "fully associative" if a byte of information can be placed anywhere in the cache memory. The cache memory is said to be "set associative" if a group of blocks of information from main memory can only be placed in a restricted set of places in the cache memory, namely, in a specified "set" of the cache memory. Computer systems ordinarily utilize a variation of set associative mapping to keep track of the bytes of information that have been copied from main memory into cache memory.
The hierarchy of a set associative cache memory resembles a matrix. That is, a set associative cache memory is divided into different "sets" (such as the rows of a matrix) and different "ways" (such as the columns of a matrix). Thus, each line of a set associative cache memory is mapped or placed within a given set (row) and within a given way (column). The number of columns, i.e., the number of lines in each set, determine the number of "ways" of the cache memory. Thus, a cache memory with four columns (four lines within each set) is deemed to be "4-way set associative."
Set associative cache memories include addresses for each line in the cache memory. Addresses may be divided into three different fields. First, a "block-offset field" is utilized to select the desired information from a line. Second, an "index field" specifies the set of cache memory where a line is mapped. Third, a "tag field" is used for purposes of comparison. When a request originates from a processor, or bus agent, for new information, the index field selects a set of cache memory. The tag field of every line in the selected set is compared to the tag field sought by the processor. If the tag field of some line matches the tag field sought by the processor, a "cache hit" is detected and information from the block is obtained directly from or modified in the high speed cache memory. If no match occurs, a "cache miss" occurs and the cache memory is typically updated. Cache memory is updated by retrieving the desired information from main memory and then mapping this information into a line of the set associative cache. When the "cache miss" occurs, a line is first mapped with respect to a set (row), and then mapped with respect to a way (column). That is, the index field of a line of information retrieved from main memory specifies the set of cache memory wherein this line will be mapped. A "replacement scheme" is then relied upon to choose the particular line of the set that will be replaced. In other words, a replacement scheme determines the way (column) where the line will be located. The object of a replacement scheme is to select for replacement the line of the set that is least likely to be needed in the near future so as to minimize further cache misses.
Several factors contribute to the optimal utilization of cache memory in computer systems: cache memory hit ratio (probability of finding a requested item in cache), cache memory access time, delay incurred due to a cache memory miss, and time required to synchronize main memory with cache memory (write back or write through). In order to minimize delays incurred when a cache miss is encountered, as well as improve cache memory hit rates, an appropriate cache memory replacement scheme is used.
Set associative cache memory replacement schemes may be divided into two basic categories: non-usage based and usage based. Non-usage based replacement schemes, which include first in, first out ("FIFO") and "random" replacement schemes, make replacement selections on some basis other than memory usage. The FIFO replacement scheme replaces the line of a given set of cache memory which has been contained in the given set for the longest period of time. The random replacement scheme randomly replaces a line of a given set.
Usage based schemes, which include the least recently used ("LRU") replacement scheme, take into account the history of memory usage. In the LRU replacement scheme the least recently used line of information in cache memory is overwritten by the newest entry into cache memory. An LRU replacement scheme assumes that the least recently used line of a given set is the line that is least likely to be reused again in the immediate future. An LRU replacement scheme thus replaces the least recently used line of a given set with a new line of information that must be copied from main memory.
When a cache miss occurs, a main memory access must be performed to obtained the desired information which will be stored in the cache. Typically a main memory read access is a cacheline, four quadwords, or 32 bytes in size. Whenever a cacheline of information from the main memory read access is returned, it is returned in toggle mode order, critical quadword first. The transfer order of the four quadwords comprising the cacheline is based on the position of the critical quadword within the cacheline. The toggle mode transfer order is based on an interleaved main memory architecture where the quadwords are woven, or interleaved, between at least two banks of main memory. The four quadwords comprising the cacheline are taken in an order that always accesses opposite main memory banks so that the main memory bank not being accessed may be charged up and ready to accept another access. The toggle mode allows better main memory performance when using dynamic random access memory (DRAM) because memory accesses are not slowed down by pre-charge delays associated with operation of the DRAM.
Increasingly inexpensive but sophisticated microprocessors have revolutionized the role of the personal computer by enabling complex applications software to run at mainframe computer speeds. The latest microprocessors have brought the level of technical sophistication to personal computers that, just a few years ago, was available only in mainframe and mini-computer systems. Some representative examples of these new microprocessors are the "PENTIUMN" and "PENTIUM PRO" (registered trademarks of Intel Corporation). Advanced microprocessors are also manufactured by Advanced Micro Devices, Cyrix, IBM, Digital Equipment Corp., Sun Microsystems and Motorola.
These sophisticated microprocessors have, in turn, made possible running complex application programs using advanced three dimensional ("3-D") graphics for computer aided drafting and manufacturing, engineering simulations, games and the like. Increasingly complex 3-D graphics require higher speed access to ever larger amounts of graphics information stored in memory. This memory may be part of the video graphics processor system, but, preferably, would be best (lowest cost) if part of the main computer system memory because shifting graphics information from local graphics memory to main memory significantly reduces computer system costs when implementing 3-D graphics. Intel Corporation has proposed a low cost but improved 3-D graphics standard called the "Accelerated Graphics Port" (AGP) initiative. With AGP 3-D, graphics data, in particular textures, may be shifted out of the graphics controller local memory to computer system main memory. The computer system main memory is lower in cost than the graphics controller local memory and is more easily adapted for a multitude of other uses besides storing graphics data.
The proposed Intel AGP 3-D graphics standard defines a high speed data pipeline, or "AGP bus," between the graphics controller and system main memory. This AGP bus has sufficient bandwidth for the graphics controller to retrieve textures from system memory without materially affecting computer system perfonnance for other non-graphics operations. The Intel 3-D graphics standard is a specification which provides signal, protocol, electrical, and mechanical specifications for the AGP bus and devices attached thereto. The AGP specification is entitled "Accelerated Graphics Port Interface Specification Revision 1.0," dated Jul. 31, 1996, the disclosure of which is hereby incorporated by reference. The AGP Specification is available from Intel Corporation, Santa Clara, Calif.
The AGP Specification uses the 66 MHz PCI (Revision 2.1) Specification as an operational baseline, with three performance enhancements to the PCI Specification which are used to optimize the AGP Specification for high performance 3-D graphics applications. These enhancements are: 1) pipelined memory read and write operations, 2) demultiplexing of address and data on the AGP bus by use of sideband signals, and 3) data transfer rates of 133 MHz for data throughput in excess of 500 megabytes per second ("MB/s"). The remaining AGP Specification does not modify the PCI Specification, but rather provides a range of graphics-oriented performance enhancements for use by 3-D graphics hardware and software designers. The AGP Specification is neither meant to replace nor diminish full use of the PCI Specification in the computer system. The AGP Specification creates an independent and additional high speed local bus for use by 3-D graphics devices such as a graphics controller, wherein the other input-output ("I/O") devices of the computer system may remain on any combination of the PCI, SCSI, EISA and ISA buses.
To functionally enable this AGP 3-D graphics bus, new computer system hardware and software are required. This requires new computer system core logic designed to function as a host bus/memory bus/PCI bus to AGP bus bridge meeting the AGP Specification, and new Read Only Memory Basic Input Output System ("ROM BIOS") and Application Programming Interface ("API") software to make the AGP dependent hardware functional in the computer system. The computer system core logic must still meet the PCI standards referenced above and facilitate interfacing the PCI bus(es) to the remainder of the computer system. In addition, new AGP compatible device cards must be designed to properly interface, mechanically and electrically, with the AGP bus connector.
AGP and PCI device cards are not physically interchangeable even though there is some commonality of signal functions between the AGP and PCI interface specifications. The resent AGP Specification only makes allowance for a single AGP device on an AGP bus, whereas, the PCI Specification allows two plug-in slots for PCI devices plus a bridge on a PCI us running at 66 MHz. The single AGP device is capable of functioning in both a 1x mode 264 MB/s peak) and a 2x mode (532 MB/s peak). The AGP bus is defined as a 32 bit bus, and may have up to four bytes of data transferred per clock in the 1x mode and up to eight bytes of data per clock in the 2x mode. The PCI bus is defined as either a 32 bit or 64 bit bus, and may have up to four or eight bytes of data transferred per clock, respectively. The AGP bus, however, has additional sideband signals which enables it to transfer blocks of data more efficiently than is possible using a PCI bus. An AGP bus running in the 2x mode provides sufficient video data throughput (532 MB/s peak) to allow increasingly complex 3-D graphics applications to run on personal computers.
A major performance/cost enhancement using AGP in a computer system is accomplished by shifting texture data structures from local graphics memory to main memory. Textures are ideally suited for this shift for several reasons. Textures are generally read-only, and therefore problems of access ordering and coherency are less likely to occur. Shifting of textures serves to balance the bandwidth load between system memory and local graphics memory, since a well-cached host processor has much lower memory bandwidth requirements than does a 3-D rendering machine; texture access comprises perhaps the single largest component of rendering memory bandwidth, so avoiding loading or caching textures in local graphics memory saves not only this component of local memory bandwidth, but also the bandwidth necessary to load the texture store in the first place, and, further, this data must pass through main memory anyway as it is loaded from a mass store device. Texture size is dependent upon application quality rather than on display resolution, and therefore may require the greatest increase in memory as software applications become more advanced. Texture data is not persistent and may reside in the computer system memory only for the duration of the software application, so any system memory spent on texture storage can be returned to the free memory heap when the application concludes (unlike a graphic controller's local frame buffer which may remain in persistent use). For these reasons, shifting texture data from local graphics memory to main memory significantly reduces computer system costs when implementing 3-D graphics.
Generally, in a computer system memory architecture the graphics controller's physical address space resides above the top of system memory. The graphics controller uses this physical address space to access its local memory which holds information required to generate a graphics screen. In the AGP system, information still resides in the graphics controller's local memory (textures, alpha, z-buffer, etc.), but some data which previously resided in this local memory is moved to system memory (primarily textures, but also command lists, etc.). The address space employed by the graphics controller to access these textures becomes virtual, meaning that the physical memory corresponding to this address space doesn't actually exist above the top of memory. In reality, each of these virtual addresses corresponds to a physical address in system memory. The graphics controller sees this virtual address space, referenced hereinafter as "AGP device address space," as one contiguous block of memory, but the corresponding physical memory addresses may be allocated in 4 kilobyte ("KB"), non-contiguous pages throughout the computer system physical memory.
There are two primary AGP usage models for 3D rendering, that have to do with how data are partitioned and accessed, and the resultant interface data flow characteristics. In the "DMA" model, the primary graphics memory is a local memory referred to as `local frame buffer` and is associated with the AGP graphics controller or "video accelerator." 3D structures are stored in system memory, but are not used (or "executed") directly from this memory; rather they are copied to primary (local) memory, to which the rendering engine's address generator (of the AGP graphics controller) makes references thereto. This implies that the traffic on the AGP bus tends to be long, sequential transfers, serving the purpose of bulk data transport from system memory to primary graphics (local) memory. This sort of access model is amenable to a linked list of physical addresses provided by software (similar to operation of a disk or network I/O device), and is generally not sensitive to a non-contiguous view of the memory space.
In the "execute" model, the video accelerator uses both the local memory and the system memory as primary graphics memory. From the accelerator's perspective, the two memory systems are logically equivalent; any data structure may be allocated in either memory, with performance optimization as the only criteria for selection. In general, structures in system memory space are not copied into the local memory prior to use by the video accelerator, but are "executed" in place. This implies that the traffic on the AGP bus tends to be short, random accesses, which are not amenable to an access model based on software resolved lists of physical addresses. Since the accelerator generates direct references into system memory, a contiguous view of that space is essential. But, since system memory is dynamically allocated in, for example, random 4,096 byte blocks of the memory, hereinafter 4 kilobyte ("KB") pages, it is necessary in the "execute" model to provide an address mapping mechanism that maps the random 4 KB pages into a single contiguous address space.
The AGP Specification, incorporated by reference hereinabove, supports both the "DMA" and "execute" models. However, since a primary motivation of the AGP is to reduce growth pressure on the graphics controller's local memory (including local frame buffer memory), the "execute" model is preferred. Consistent with this preference, the AGP Specification requires a virtual-to-physical address re-mapping mechanism which ensures the graphics accelerator (AGP master) will have a contiguous view of graphics data structures dynamically allocated in the system memory. This address re-mapping applies only to a single, programmable range of the system physical address space and is common to all system agents. Addresses falling in this range are re-mapped to non-contiguous pages of physical system memory. All addresses not in this range are passed through without modification, and map directly to main system memory, or to device specific ranges, such as a PCI device's physical memory. Re-mapping is accomplished via a "Graphics Address Remapping Table" ("GART table") which is set up and maintained by a GART miniport driver software, and used by the core logic chipset to perform the re-mapping. In order to avoid compatibility issues and allow future implementation flexibility, this mechanism is specified at a software (API) level. In other words, the actual GART table format may be abstracted to the API by a hardware abstraction layer ("HAL") or mini-port driver that is provided with the core logic chipset, While this API does not constrain the future partitioning of re-mapping hardware, the remapping function will typically be implemented in the core logic chipset.
The contiguous AGP graphics controller's device addresses are mapped (translated) into corresponding physical addresses that reside in the computer system physical memory by using the GART table which may also reside in physical memory. The GART table is used by the core logic chipset to remap AGP device addresses that can originate from either the AGP, host, or PCI buses. The GART table is managed by a software program called a "GART miniport driver." The GART miniport driver provides GART services for the computer software operating system.
Residing in the system memory, the GART table may be read from and/or written to by the core logic driver software, i.e. the aforementioned GART miniport driver, or any other software program or application specific interface ("API") program using the host microprocessor(s), AGP graphics devices, or a PCI device. The GART table is used by the computer system core logic to remap the virtual addresses of the graphics data requested by the AGP graphics controller to physical addresses of pages that reside in the computer system memory (translate addresses). Thus, the AGP graphics controller can work in contiguous virtual address space, but use non-contiguous pages of physical system memory to store graphics data such as textures and the like.
Typically, the core logic will cache a subset of the most recently accessed GART table entries to increase system perfonnance when mapping from the AGP device address space (AGP virtual address space) to the physical address space of the computer system main memory. A GART table entry is typically a doubleword which is four bytes in size. An access to main memory is typically a cacheline which is four quadwords or 32 bytes in size, the desired quadword is returned in toggle mode order as described above. If only one GART table entry (a doubleword) is stored in the core logic cache for each memory access of the GART table, half of the quadword and most of the cacheline memory access (three quadwords) will not be utilized, and for each subsequent cacheline miss another memory access must be performed.
What is needed is a system, method and apparatus for improving the probability of GART cache hits and to better utilize the cacheline data returned from a memory access of the GART table stored in the computer system main memory.