Not applicable.
1. Field of the Invention
The present invention generally relates to a computer system that includes one or more random access memory (xe2x80x9cRAMxe2x80x9d) devices for storing data. More particularly, the invention relates to a computer system with RAM devices in which multiple banks of storage can be accessed simultaneously to enhance the performance of the memory devices. Still more particularly, the present invention relates to a system for the mapping of processor addresses to memory device addresses that effectively minimizes simultaneous accesses to the same bank of memory to avoid access delays.
2. Background of the Invention
Superscalar processors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. On the other hand, superpipelined processor designs divide instruction execution into a large number of subtasks which can be performed quickly, and assign pipeline stages to each subtask. By overlapping the execution of many instructions within the pipeline, superpipelined processors attempt to achieve high performance.
Superscalar processors demand low main memory latency due to the number of instructions attempting concurrent execution and due to the increasing clock frequency (i.e., shortened clock cycle) employed by the processors. Many of the instructions include memory operations to fetch (xe2x80x9creadxe2x80x9d) and update (xe2x80x9cwritexe2x80x9d) memory operands. The memory operands must be fetched from or conveyed to main memory, and each instruction must originally be fetched from main memory as well. Similarly, processors that are superpipelined demand low main memory latency because of the high clock frequency employed by these processors and the attempt to begin execution of a new instruction each clock cycle. It is noted that a given processor design may employ both superscalar and superpipelined techniques in an attempt to achieve the highest possible performance characteristics.
Processors are often configured into computer systems that have a relatively large and slow main memory. Typically, multiple random access memory (xe2x80x9cRAMxe2x80x9d) modules comprise the main memory system. The RAM modules may be Single Inline Memory Modules (xe2x80x9cSIMMxe2x80x9d), Double Inline Memory Modules (xe2x80x9cDIMMxe2x80x9d), or RAMbus(trademark) Inline Memory Modules (xe2x80x9cRIMMxe2x80x9d) that incorporate a number of Random Access Memory (xe2x80x9cRAMxe2x80x9d) devices (see xe2x80x9cRAMBUS Preliminary Information Direct RDRAM(trademark)xe2x80x9d, Document DL0060 Version 1.01; xe2x80x9cDirect Rambus(trademark) RIMM(trademark) Module Specification Version 1.0xe2x80x9d, Document SL-0006-100; xe2x80x9cRambus(copyright) RIMM(trademark) Module (with 128/144 Mb RDRAMs)xe2x80x9d Document DL00084 Version 1.1, all of which are incorporated by reference herein). RAM devices may be Dynamic Random Access Memory (xe2x80x9cDRAMxe2x80x9d) devices, RAMbus(trademark) DRAM (xe2x80x9cRDRAMxe2x80x9d) or any of a number of other types of memory storage devices. Each RAM device consists of a DRAM core section containing memory banks organized into rows and columns, with each column containing a number of bytes (in the preferred embodiment 16 bytes). A large main memory provides storage for a large number of instructions and/or a large amount of data for use by the processor, providing faster access to the instructions and/or data than may be achieved for example from disk storage. However, the access times of modem RAMs are significantly longer than the clock cycle length of modem processors. The memory access time for each set of bytes being transferred to the processor is therefore long. Accordingly, the main memory system is not a low latency system. Processor performance may suffer due to high memory latency.
Many types of RAMs employ a xe2x80x9cpage modexe2x80x9d which allows for memory latency to be decreased for transfers within the same xe2x80x9cpagexe2x80x9d. Generally, as explained above, RAMs comprise memory arranged into rows and columns of storage. A first portion of the address identifying the desired data/instructions is used to select one of the rows (the xe2x80x9crow addressxe2x80x9d), and a second portion of the address is used to select one of the columns (the xe2x80x9ccolumn addressxe2x80x9d). One or more bytes residing at the selected row and columns are provided as output of the RAM. Typically, the row address is provided to the RAM first, and the selected row is placed into a temporary sense amplifier buffer within the RAM. The row of data that is stored in the RAM""s sense amplifier is referred to as a page. Thus, addresses having the same row address are said to be in the same page. Subsequent to the selected row being placed into the sense amplifier buffer, the column address is provided and the selected data is output from the RAM. A row/page hit occurs if the next address to access the RAM is within the same row/page stored in the sense amplifier buffer. Thus, the next access may be performed by providing the column portion of the address only, omitting the row address transmission. The next access to a different column may therefore be performed with lower latency, saving the time required for transmitting the row address because the page corresponding to the row has already been activated. The size of a row/page is dependent upon the number of columns within the row/page. The row/page stored in the sense amplifier within the RAM is referred to as an xe2x80x9copen pagexe2x80x9d, since accesses within the open page can be performed by transmitting the column portion of the address only.
Unfortunately, the first access to a given row/page generally does not occur to an open row/page, thereby incurring a higher memory latency. Even further, the first access may experience a row/page miss. A row/page miss can occur if the sense amplifier has another particular row/page open, and the particular row/page must first be closed before opening the row/page containing the current access. A row/page miss can also occur if the sense amplifier is empty. Often, this first access is critical to maintaining performance in the processor within the computer system, as the data/instructions are immediately needed to satisfy a miss. Instruction execution may stall because of the row/page miss while the row/page containing the current access is being opened. The more often that instructions can access main memory using row/page hits, the lower the latency of memory access and the better the system performance. In a memory system containing many RAM devices and thus a large number of sense amplifier buffers, a large amount of memory can be accessed using row/page hits, resulting in an increased opportunity to maximize performance.
Software applications executing on the computer system frequently perform read or write operations that include a processor memory address mapped to a device address. The device address identifies a DRAM device, memory banks within the DRAM device, and rows and columns within each memory bank. The mapping of the processor memory address to the device address selects the DRAM device and row and column and manages memory bank conflicts. Memory bank conflicts are caused by attempts to perform a read or write to a memory bank within a DRAM device while another read or write is occurring to the same memory bank. Memory bank conflicts degrade memory system performance because memory transactions must be delayed while a previous memory transaction completes within the DRAM device. Thus, to increase system performance the mapping strategy implemented must reduce memory bank conflicts. Because memory configurations can vary widely in the number of DRAM devices present as well as the organization of the DRAM devices (i.e., number of memory banks, interface logic operation), it is highly desirable to permit a system programmer to program the mapping scheme for each particular configuration and software application to allow maximum system performance.
The mapping of processor memory addresses to device addresses for optimal performance must take into account read and write traffic patterns on main memory. One property of read/write memory traffic is referred to as locality of reference. Locality of reference means that if a memory address xe2x80x9cAxe2x80x9d is accessed, then it is likely that the next address xe2x80x9cBxe2x80x9d is near or adjacent to xe2x80x9cA.xe2x80x9d An address-mapping scheme should not result in memory bank conflicts from successive accesses to contiguous addresses in main memory. For example, assume that a software application is performing reads and writes to a large contiguous area of main memory that spans row/page boundaries. As long as memory is being accessed from the same row/page in the sense amplifier, no row/page misses occur and thus the page in the sense amplifier does not have to be replaced with a different row/page. However, when the end of the row/page in the sense amplifier is reached and the next row/page is required, a row/page close cycle is needed to store the old row/page and a row/page open cycle is required to open the new row/page. If the processor memory address to device address mapping scheme is such that the next required row containing the new page is in the same memory bank or an adjacent memory bank (for DRAM devices in which memory banks share sense amplifiers) as the row containing the previous page, opening the next row/page to perform reads and writes will be delayed while the closure of the previous row/page completes. It would be advantageous if successive reads and writes to contiguous rows/pages of memory resulted in accesses to different nonadjacent memory banks of the DRAM device.
Another common read/write traffic pattern occurs in processors that include cache memories. Processors use cache memory in memory systems to improve computer system performance. A cache memory holds a subset of the contents of main memory and is faster and smaller than main memory. An architecture common in the art provides a level one (xe2x80x9cL1xe2x80x9d) cache on the same integrated circuit as the microprocessor and a level two (xe2x80x9cL2xe2x80x9d) cache either on the same integrated circuit as the microprocessor or on the system board of the computer. The smallest unit of memory that can be loaded into a cache memory is known as a cache block. A set associative cache is divided up into sets with each set containing two or more block frames that store blocks of data from main memory. A block of data from main memory is first mapped into a set of the cache and then it can be placed anywhere within the set. The cache placement is called n-way set associative if there are n block frames in a set.
Read/write memory transactions in the computer system may result in the cache memory becoming full. A read or write request to a memory block not present in the cache would then result in the replacement of a existing memory block present in a set of the cache memory. If the cache memory is a writeback set associative cache, the new read or write requests can result in the replacement of modified data in a cache block that must be written back to main memory. Each processor address in a block of data from main memory is mapped to a cache address that includes an index subfield identifying the particular set in the cache that the data block would be placed into. Thus, the addresses of blocks of data in the block frames of a particular set in the cache have the same index subfield and other blocks of data in main memory may also have the same index subfield. A processor address to device address mapping scheme should advantageously seek to prevent memory bank conflicts from occurring by mapping the portion of the address that is not equal (i.e., fields other than index subfield) in such a manner so that the mapped memory banks selected are different. Despite the apparent performance advantages of such a mapping scheme, to date no such system allowing flexibility to maximize performance over all memory hardware configurations has been implemented.
The problems noted above are solved in large part by the systems and techniques of the preferred embodiment of the present invention, which avoids delays resulting from memory bank conflicts. Preferably, a computer system contains a processor that includes a software programmable memory mapper. The memory mapper maps an address generated by the processor into a device address for accessing physical main memory. The processor also includes a cache controller that maps the processor address into a cache address. The cache address places a block of data from main memory into a memory cache using an index subfield. The physical main memory contains RDRAM devices, each of the RDRAM devices containing a number of memory banks that store rows and columns of data. The memory mapper maps processor addresses to device addresses to increases memory system performance. The mapping minimizes memory access conflicts between the memory banks.
Conflicts between memory banks are reduced by placing a number of bits corresponding to the bank subfield above the most significant boundary bit of the index subfield. This diminishes the likelihood of page misses resulting from the replacement of data blocks in the cache memory because the read of the new data block and write of the victim data block are not to the same memory bank.
Adjacent memory bank conflicts are reduced for sequential accesses to memory banks by reversing the bit order of a bank number subfield within the bank subfield of the device address.