1. Field of the Invention
This invention is related to the field of computer systems and microprocessors, and, more particularly, to memory latency solutions within computer systems.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. On the other hand, superpipelined microprocessor designs divide instruction execution into a large number of subtasks which can be performed quickly, and assign pipeline stages to each subtask. By overlapping the execution of many instructions within the pipeline, superpipelined microprocessors attempt to achieve high performance.
Superscalar microprocessors demand low memory latency due to the number of instructions attempting concurrent execution and due to the increasing clock frequency (i.e. shortening clock cycle) employed by the superscalar microprocessors. Many of the instructions include memory operations to fetch (read) and update (write) memory operands. The memory operands must be fetched from or conveyed to memory, and each instruction must originally be fetched from memory as well. Similarly, superpipelined microprocessors demand low memory latency because of the high clock frequency employed by these microprocessors and the attempt to begin execution of a new instruction each clock cycle. It is noted that a given microprocessor design may employ both superscalar and superpipelined techniques in an attempt to achieve the highest possible performance characteristics.
Microprocessors are often configured into computer systems which have a relatively large, relatively slow main memory. Typically, multiple dynamic random access memory (DRAM) modules comprise the main memory system. The large main memory provides storage for a large number of instructions and/or a large amount of data for use by the microprocessor, providing faster access to the instructions and/or data than may be achieved from a disk storage, for example. However, the access times of modern DRAMs are significantly longer than the clock cycle length of modem microprocessors. The memory access time for each set of bytes being transferred to the microprocessor is therefore long. Accordingly, the main memory system is not a low latency system. Microprocessor performance may suffer due to high memory latency.
In order to allow low latency memory access (thereby increasing the instruction execution efficiency and ultimately microprocessor performance), computer systems typically employ one or more caches to store the most recently accessed data and instructions. Additionally, the microprocessor may employ caches internally. A relatively small number of clock cycles may be required to access data stored in a cache, as opposed to a relatively larger number of clock cycles required to access the main memory.
Low memory latency may be achieved in a computer system if the cache hit rates of the caches employed therein are high. An access is a hit in a cache if the requested data is present within the cache when the access is attempted. On the other hand, an access is a miss in a cache if the requested data is absent from the cache when the access is attempted. Cache hits are provided to the microprocessor in a small number of clock cycles, allowing subsequent accesses to occur more quickly as well and thereby decreasing the effective memory latency. Cache misses require the access to receive data from the main memory, thereby increasing the effective memory latency.
Many types of DRAMs employ a "page mode" which allows for memory latency to be decreased for transfers within the same "page". Generally, DRAMs comprise memory arranged into rows and columns of storage. A first portion of the address identifying the desired data/instructions is used to select one of the rows (the "row address"), and a second portion of the address is used to select one of the columns (the "column address"). One or more bits residing at the selected row and column is provided as output of the DRAM. Multiple DRAMs may be accessed concurrently (a "bank") to provide more output bits per access. Typically, the row address is provided to the DRAM first, and the selected row is placed into a temporary buffer within the DRAM. Subsequently, the column address is provided and the selected data is output from the DRAM. If the next address to access the DRAM is within the same row (i.e. the row address is the same as the current row address) then that next access may be performed by providing the column portion of the address only, omitting the row address transmission. The next access may therefore be performed with lower latency, saving the time required for transmitting the row address. Addresses having the same row address are said to be in the same page. The size of the page may therefore be dependent upon the number of columns within the row and the number of DRAMs included within the bank. The row, or page, stored in the temporary buffer within the DRAM is referred to as the "open page", since accesses within the open page can be performed in page mode (i.e. by transmitting the column portion of the address only)
Unfortunately, the first access to a given page generally must occur in non-page mode, thereby incurring a higher memory latency. Even further, the access may experience a page mode miss (where the DRAMs still have a particular page open, and the particular page must first be closed before opening the page containing the current access). Often, this first access is critical to maintaining performance in the microprocessors within the computer system, as the data/instructions are immediately needed to satisfy a cache miss. Instruction execution may stall while the memory responds in non-page mode. A method for increasing the number of memory accesses performed in page mode is therefore desired.