The present invention relates to computer memory systems. More specifically, the present invention relates to routing address signals to memory banks in a computer system to achieve various memory interleaving strategies.
In the art of computing, it is common to store program instructions and data in dynamic random access memory (DRAM). The most common type of DRAM memory cell is a single transistor coupled to a small capacitor. A data bit is represented in the memory cell by the presence or absence of charge on the capacitor. The cells are organized into an array of rows and columns.
FIG. 1 is a block diagram of a typical prior art memory chip 10 that is based on a 4 megabit memory array 12 having 2,048 rows and 2,048 columns. Memory chip 10 has a 4 bit wide data input/output path. Row demultiplexer 15 receives an 11 bit row address and generates row select signals that are provided to memory array 12. Page buffer 14 acts as a temporary storage buffer for rows of data from array 12. Column multiplexer 16 receives a 9 bit column address and multiplexes the 4 bit data input/output path to a selected portion of buffer 14.
The distinction between rows and columns is significant because of the way a memory access proceeds. Page buffer 14 is formed from a single row of cells. The cells act as a temporary staging area for both reads and writes. A typical DRAM access consists of a row access cycle, one or more column accesses cycles, and a precharge cycle. The precharge cycle will be described in greater detail below.
The row access cycle (also called a page opening) is performed by presenting the row address bits to row demultiplexer 15 to select a row. The entire contents of that row are then transferred into page buffer 14. This transfer is done in parallel, and it empties all memory cells in that row of their contents. The transfer is done by driving whatever charge exists in each row capacitor down to a set of amplifiers that load page buffer 14. This operation also erases the contents of the capacitors of the row that is accessed. For typical prior art DRAMs, this operation takes approximately 30 ns.
Next, the column access cycle is performed by presenting the column address bits to select a particular column or set of columns, and the data is either read from or written to page buffer 14. During the column access cycle, page buffer 14 acts as a small RAM. The typical access delay for this operation is approximately 30 ns to receive the first 4 bits of data, and 10 ns to receive subsequent 4 bit chunks of data. Several consecutive accesses can be made to the page to access different columns, thereby allowing the entire row to be written to or read from very quickly. For a typical four bit wide DRAM such as that shown in FIG. 1, a page of 2,048 bits (or 256 bytes) can be read out in 512 accesses, or 5.14 xcexcs. Accordingly, the bandwidth of DRAM chip 10 is 49.8 megabytes per second. It is easy to see how a few DRAM chips in parallel can yield very high bandwidth.
The final cycle of the memory access is the precharge cycle, which is also known in the art as page closing. As discussed above, the row access cycle destroyed the contents of the capacitors of the row that was read into buffer 14. Before another row can be read into buffer 14, the contents in page buffer 14 must be transferred back to memory array 12. This process is called the precharge cycle. In most prior art DRAM chips, no address is required because the address of the open row is latched when the contents of that row are transferred into buffer 14, and that address is retained as long as the page is open. Typically, the precharge cycle lasts about 40 ns.
In addition to the normal read and write access cycles, most DRAMs also require refresh cycles. The small capacitors that make up each memory cell suffer from leakage, and after a short period of time, the charge will drain away. To prevent the loss of data, each row must be precharged (opened and closed) at a certain minimum rate. The size of the capacitors and leakage allowed is balanced with the size of the array in such a way that the number of refresh cycles required is a small fraction of the total bandwidth of the DRAM. Typically, DRAMs are engineered such that refreshing the rows at a rate of one row per 15.6 microseconds is sufficient to maintain the data. Accordingly, while data can be written to and read from page buffer 14 many consecutive times, buffer 14 cannot be held open indefinitely because it must be periodically closed to allow other rows to be refreshed.
There are two primary types of DRAMs known in the art, asynchronous DRAMs and synchronous DRAMs. Asynchronous DRAMs do not have a clock input. Rather, complex timing constraints among various signals and addresses must be satisfied in order for the DRAM to operate properly. The two main control pins for asynchronous DRAMs are xe2x80x9crow address strobexe2x80x9d (RAS) and xe2x80x9ccolumn address strobexe2x80x9d (CAS). To open a row, RAS is asserted (typically, lowered). To close a row, RAS is deasserted. To access a column CAS is asserted, and to access another column, CAS must be deasserted and then reasserted. Note that CAS can be asserted and deasserted multiple times while RAS is asserted.
In contrast to asynchronous DRAMs, synchronous DRAMs (SDRAMs) accept a clock input, and almost all timing delays are specified with respect to this clock. In addition, SDRAMs usually have between two and eight different logical arrays of memory (or banks) that can operate independently. Rather than use separate RAS and CAS signals for each bank, a sequence of commands is sent to the DRAM synchronously to perform page opening, column access, and page closing functions. Additional address bits are used for bank selection. One major benefit provided by SDRAMs is pipelining. While one bank is being accessed, another bank can be refreshed or precharged in the background.
Despite these differences, SDRAM organization is very similar to asynchronous DRAM organization. In fact, many memory controllers for asynchronous DRAMs support multiple banks and background refreshing and precharging operations.
In the prior art, the term xe2x80x9cbankxe2x80x9d was traditionally used to denote a group of asynchronous DRAM chips that where accessed in parallel. Accordingly, a bank was accessed by generating a bank select signal, along with appropriate row and column addresses, as described above. However, a single SDRAM chip has multiple banks. Therefore, the term xe2x80x9crankxe2x80x9d is used to denote a group of SDRAM chips that are accessed in parallel, and additional bank bits are routed to the SDRAM rank. In a system capable of supporting either SDRAMs or asynchronous DRAMs, typically the higher order bank bits that are used when accessing asynchronous DRAMs are used as rank bits when accessing SDRAMs, and the lower order bank bits that are used when accessing asynchronous DRAMs are routed to the SDRAMs. It should be noted that each bank within an SDRAM rank has its own set of page buffers.
DRAM chips can be organized to form main memory systems in a variety of ways. Typically the width and speed of the system bus are synchronized to the width and speed of the main memory system bus by providing the main memory system bus with the same bandwidth as the system bus. Usually system busses are both faster and wider than the data I/O interface provided by individual DRAM chips, so multiple DRAM chips are arranged in parallel to match the bandwidth of the system bus. If a particular computer system has a 16 byte wide data bus that operates at 66 MHZ, then a main memory subsystem of the computer system that operates at 33 MHZ and is constructed with 4-bit wide DRAM chips will typically have 64 DRAM chips arranged in each bank, thereby providing each bank with a bandwidth of nearly a gigabyte per second, which matches the bandwidth of the system data bus. If the bandwidths are not matched, other techniques may be employed, such as using a small FIFO to buffer memory accesses and blocking memory accesses when the FIFO is full.
Consider the page buffers of the DRAM chips that form a single bank. All the individual page buffers are accessed in parallel, thereby combining to form a larger xe2x80x9clogicalxe2x80x9d page buffer. As shown in FIG. 1, each DRAM chip 10 has a 2,048 bit, or 256 byte, page buffer 14. If 32 chips are arranged in parallel, the logical page buffer is 8,192 bytes wide. If the low order address bits are used to index columns, two memory locations having addresses that differ only in the lower 13 bits of the logical memory address will be in the same row, and therefore will be available in a logical page buffer concurrently.
Each bank of DRAM chips has its own set of page buffers. Therefore, a logical page buffer exists for each memory bank provided in the computer system. If the high order address bits are used to select banks, then there is an 8 kilobyte logical page buffer for the first 16 megabytes of physical memory, another 8 kilobyte logical page buffer for the next 16 megabytes of physical memory, and so on.
If the system described above employed SDRAMs having bank select bits, the internal banks of the SDRAMs may be viewed as collections of relatively independent banks of DRAMs, with the high order bank bits used as rank select bits and the low order bank bits routed to the SDRAMs. Accordingly, for the purpose of illustrating the present invention below, there is little difference between the memory banks that are derived from collections of chips addressed independently, and the memory banks within SDRAM chips, except that in the latter case some of the bank bits are routed to the SDRAM chips.
Consider a typical cache line read in the system described above. First, the appropriate bank is selected, and then a row is transferred into the logical page buffers. This takes approximately 30 ns. Next, 4 16-byte chunks are read from the logical page buffer; this takes approximately 60 ns (30 ns for the first 16 byte chunk, and 10 ns for each of the next three 16 byte chunks), and provides a complete cache line. Finally, the logical page buffer is closed; this takes 40 ns. The total time was 130 ns. The time before the first word was read was 60 ns (page open plus first column access). Many system are configured such that the first word available is the first word required by the CPU. The time required to retrieve the first word is known in the art as the xe2x80x9ccritical word latencyxe2x80x9d.
It is common in the art for a memory controller to gamble that successive references to the same memory bank will access the same row (or page). Such a memory controller is known as a page mode memory controller. A page hit occurs when the memory controller processes a memory access request, and finds that the row that needs to be accessed is already in the logical page buffer. In a page mode memory controller, the page is not closed after an access. Instead, the pace is only closed when an access to that bank requires a different page or a refresh cycle occurs.
If a subsequent memory access is indeed for the same page, then the critical word latency is shortened from 60 ns to just 10 ns, a significant savings. If a subsequent memory access is not for the same page, then a penalty is incurred. The old page stored in the logical page buffer must undergo a precharge cycle before a new page can be opened, so the critical word latency is 40 ns (precharge) plus 30 ns (row access) plus 30 ns (first word available), or 100 ns, quite a bit more than the previous value of 60 ns that is achieved when the logical page buffer is precharged after every access.
If p is the probability that the next access is on the same page, then the average critical word latency is 30 ns*p+100 ns*(1xe2x88x92p), (or 100 nsxe2x88x9270 ns*p). Note that the critical word latency decreases as p increases. The point at which the gamble pays off is when the average critical word latency is 60 ns, which, as described above, is the critical word latency achieved when the logical page buffer is closed after each memory access. Accordingly, the point at which it pays to keep the logical page buffer open after each access occurs when there is a greater than 0.571 probability that a sequential memory access will reference the same page.
Assume that in a computer system having a page mode memory controller, requests are fed to the memory controller as fast as they can be consumed. Each time a page in a bank is accessed for the first time requires a precharge cycle to close the old page and a row access page to open the new page, which together require 70 ns. As described above, each cache line access from an open page requires 60 ns. Thus, an average cache line access requires 60 ns+70 ns (1xe2x88x92p). In contrast, as discussed above, a non-page mode memory controller requires 90 ns.
In the prior art, many page mode memory controllers simply mapped column bits to the least significant bits of the address, mapped row bits to the address bits immediately after the column bits, and then mapped bank select bits to the highest bits of the address. Given this configuration, assume that a large contiguous memory block that spans page boundaries must be accessed. As long as memory is being accessed from a single page buffer, no precharge cycles are required. However, when the end of the page is reached and the next page is required, a precharge cycle is required to store the old page and a row access cycle is required to access the new page. Since the row bits are arranged as described above. the next row required will be in the same bank as the previous row (unless the memory block spans a bank boundary).
U.S. Pat. No. 5,051,889 to Fung et al. and entitled xe2x80x9cPage Interleaved Memory Accessxe2x80x9d provides an improvement when accessing contiguous memory that spans page boundaries. Basically, Fung et al. swap the first bank select bit with the first row select bit, thereby causing even memory pages to be stored in a first bank, and odd memory pages to be stored in a second bank. Accordingly, when a series of sequential memory accesses to a contiguous segment of memory cross a page boundary, the memory accesses also cross a bank boundary, which allows the precharge cycle of the first bank to be overlapped with the row access cycle of the second bank. The system disclosed by Fung et al. also allows two contiguous pages to be open at once, thereby allowing a program with an active xe2x80x9chot spotxe2x80x9d that spans two contiguous pages to achieve a higher page hit rate.
A similar technique was proposed by Mike Bell and Tom Holman in a paper entitled xe2x80x9cPentium(copyright) Pro Workstation/Server PCI Chipsetxe2x80x9d, which was published in the Digest of Papers of the 41st IEEE Computer Society International Conference held Feb. 25-28, 1996. The technique proposed by Bell and Holman is called address bit permuting, and like the memory scheme disclosed by Fung et al., involves swapping a bank bits and row bits.
While it is desirable to increase page hit rates, in a multi-processor system, it is also desirable to distribute memory accesses among different banks. One of the easiest ways to do this is to ensure that each processor distributes its accesses across different banks. One method known in the art that provided this feature is referred to as xe2x80x9ccache line interleavingxe2x80x9d. Basically, cache line interleaving routes one or more bank bits to the address bits immediately above a cache line. Therefore, one cache line is stored in a first bank, the next cache line is stored in a second bank, and so one. In non-page mode controllers, this allows row access cycles and precharge cycles to be overlapped as contiguous cache lines are accessed. It also ensures that each processor""s memory accesses are evenly distributed across memory banks, and thereby ensures that multiple processors will not be continuously contending for the same bank. Of course, cache line interleaving seeks to distribute multiple accesses across many banks, while a page mode controller seeks to route multiple accesses to the same bank, so these techniques are in conflict.
In a typical computer system memory is usually provided by single in-line memory modules (SIMMs) and/or dual in-line memory modules (DIMMs). The DIMMs and SIMMs are typically constructed using asynchronous DRAM chips or SDRAM chips. Usually a computer system will have a series of SIMM and/or DIMM sockets that accept the memory modeules. Since SIMMs and DIMMs come in a variety of configurations, are constructed from different types of chips, and all sockets need not be populated, a memory controller of the computer system must have the ability to route address bits to various rank, bank, row, and column bits. Providing page interleaving greatly complicates this routing.
The present invention provides a method and apparatus for determining interleaving schemes in a computer system that supports multiple interleaving schemes. In one embodiment, a memory interleaving scheme lookup table is used to assign memory interleaving schemes based on the number of available bank bits.
Another embodiment of the present invention is based on the realization that the percentage of concurrent memory operations may be increased by assigning memory interleaving schemes to bank bits based on the classification of bank bits. Consider a memory controller that provides separate memory busses that support independent simultaneous memory transactions, with each bus coupled to a memory buffer/multiplexer unit that provides memory bus segments that allow memory read operations to be overlapped with memory write operations, with each memory bus segment capable of carrying a single memory operation at any given time. Bank bits that distinguish between memory busses are classified as class A, bank bits that distinguish between memory bus segments are classified as class B, and bank bits that distinguish between banks on a memory bus segment are classified as class C.
Assume that the memory controller supports multi-cache line interleaving, cache effect interleaving, and DRAM page interleaving. Multi-cache line interleaving attempts to distribute memory hot spots across several banks so that multiple CPUs tend not to access the same memory bank at the same time. The memory access patterns associated with multi-cache line interleaving will tend to be independent and unrelated. Multiples Write may occur simultaneously, multiple reads may occur simultaneously, reads and writes may occur simultaneously, and so on. Therefore, class A bank bits are optimally allocated to multi-cache line interleaving.
A dirty cache line is a cache line that contains memory contents which have been altered by the processor. Therefore the cache line contents must be written back to main memory before the cache line can be replaced. Cache effect interleaving allows a dirty cache line that is cast out from a set of a cache to be written to a different DRAM page than a cache line being read into the same set. Therefore, typically read and write operations will occur in pairs, as one cache line is read into the cache from one bank while another cache line is cast out from the cache and written to another bank. Therefore, class B bank bits are optimally allocated to cache effect interleaving. Class A bank bits could also be optimally allocated to cache-effect interleaving, but for the reasons discussed above, it is better to reserve class A bank bits multi-cache line interleaving, especially in a multi-processor system.
DRAM page interleaving causes contiguous (or proximate) DRAM pages to be stored in separate banks, thereby allowing a program to have a memory hot spot that remains open in more than one bank. The memory access patterns associated with DRAM page interleaving tend to be serial in nature, and tend to be of the same type. For example, when program code is loaded the cache, the program code will be loaded sequentially and most of the memory operations will be memory read operations. Similarly, when a program writes a block of data back to memory, the block of data will first be written to the cache. and the corresponding cache lines where the data is stored will all be dirty. If the cache lines that are replaced where also xe2x80x9cdirtiedxe2x80x9d in a similar manner, then the cache lines that are cast out form the cache will tend to be serial and most of the operations will be memory write operations. Therefore, class C bank bits are optimally allocated DRAM page interleaving. Class A and B bank bits could also be optimally allocated to DRAM page interleaving, but for the reasons discussed above, it is better to reserve class A bank bits multi-cache line interleaving and bank B bits for cache effect interleaving.
In accordance with an embodiment of the present invention, memory interleaving schemes are assigned to bank bits based on the classification of the bank bits using a memory interleaving scheme lookup table. In another embodiment, memory interleaving schemes are assigned to bank bits based on the classification of the bank bits using an algorithm.
The present invention provide a convenient, easy-to-configure method of allocating interleaving schemes to bank bits. The number of bank bits assigned to each interleaving scheme affects the page hit rate. In addition, the present invention allows the percentage of concurrent memory transactions to be increased by allocating bank bit to interleaving schemes based on the classification of the bank bits.