The present invention relates generally to computer memory systems and more specifically to configuring memory systems to optimize the use of page mode to improve the performance of computer memory systems.
In the art of computing, it is common to store program instructions and data in dynamic random access memory (DRAM). The most common type of DRAM memory cell is a single transistor coupled to a small capacitor. A data bit is represented in the memory cell by the presence or absence of charge on the capacitor. The cells are organized into an array of rows and columns.
FIG. 1 (PRIOR ART) is a block diagram of a typical prior art memory chip 10 that is based on a 4 megabit memory array 12 having 2,048 rows and 2,048 columns. Memory chip 10 has a 4-bit wide data input/output path. Row demultiplexor 15 receives an 11-bit row address and generates row select signals that are provided to memory array 12. Page buffer 14 acts as a temporary storage buffer for rows of data from array 12. Column multiplexor 16 receives a 9-bit column address and multiplexes the 4-bit data input/output path to a selected portion of buffer 14.
The distinction between rows and columns is significant because of the way a memory access proceeds. Page buffer 14 is formed from a single row of cells. The cells act as a temporary staging area for both reads and writes. A typical DRAM access consists of a row access cycle, one or more column accesses cycles, and a precharge cycle. The precharge cycle will be described in greater detail below.
The row access cycle (also called a page opening) is performed by presenting the row address bits to row demultiplexor 15 to select a row. The entire contents of that row are then transferred into page buffer 14. This transfer is done in parallel, and it empties all memory cells in that row of their contents. The transfer is done by driving whatever charge exists in each row capacitor down to a set of amplifiers that load page buffer 14. This operation also erases the contents of the capacitors of the row that is accessed. For typical prior art DRAMs, this operation takes approximately 30 ns.
Next, the column access cycle is performed by presenting the column address bits to select a particular column or set of columns, and the data is either read from or written to page buffer 14. During the column access cycle, page buffer 14 acts as a small RAM. The typical access delay for this operation is approximately 30 ns to receive the first 4 bits of data, and 10 ns to receive subsequent 4 bit chunks of data. Several consecutive accesses can be made to the page to access different columns, thereby allowing the entire row to be written to or read from very quickly. For a typical four bit wide DRAM such as that shown in FIG. 1 (PRIOR ART), a page of 2,048 bits (or 256 bytes) can be read out in 512 accesses, or 5.14 xcexcs. Accordingly, the bandwidth of DRAM chip 10 is about 50 megabytes per second. It is easy to see how a few DRAM chips in parallel can yield very high bandwidth.
The final cycle of the memory access is the precharge cycle, which is also known in the art as page closing. As discussed above, the row access cycle destroyed the contents of the capacitors of the row that was read into buffer 14. Before another row can be read into buffer 14, the contents in page buffer 14 must be transferred back to memory array 12. This process is called the precharge cycle. In most prior art DRAM chips, no address is required because the address of the open row is latched when the contents of that row are transferred into buffer 14, and that address is retained as long as the page is open. Typically, the precharge cycle lasts about 40 ns.
In addition to the normal read and write access cycles, most DRAMS also require refresh cycles. The small capacitors that make up each memory cell suffer from leakage, and after a short period of time, the charge will drain away. To prevent the loss of data, each row must be precharged (opened and closed) at a certain minimum rate. The size of the capacitors and leakage allowed is balanced with the size of the array in such a way that the number of refresh cycles required is a small fraction of the total bandwidth of the DRAM. Typically, DRAMs are engineered such that refreshing the rows at a rate of one row per 60 microseconds is sufficient to maintain the data. This refresh cycle requires the page buffer to store the row being refreshed. Thus, while data can be written to and read from page buffer 14 many consecutive times, buffer 14 cannot be held open indefinitely because it must be periodically closed to allow other rows to be refreshed.
There are two primary types of DRAMs known in the art, asynchronous DRAMs and synchronous DRAMs. Asynchronous DRAMs do not have a clock input. Rather, complex timing constraints among various signals and addresses must be satisfied in order for the DRAM to operate properly. The two main control pins for asynchronous DRAMs are xe2x80x9crow access strobexe2x80x9d (RAS) and xe2x80x9ccolumn address strobexe2x80x9d (CAS). To open a row, RAS is asserted (typically, lowered). To close a row, RAS is deasserted. To access a column CAS is asserted, and to access another column, CAS must be deasserted and then reasserted. Note that CAS can be asserted and deasserted multiple times while RAS is asserted.
In contrast to asynchronous DRAMs, synchronous DRAMs (SDRAMs) accept a clock input, and almost all timing delays are specified with respect to this clock. In addition, SDRAMs usually have two or four different logical arrays of memory (or banks) that can operate independently. Rather than use separate RAS and CAS signals for each bank, a sequence of commands is sent to the DRAM synchronously to perform page opening, column access, and page closing functions. Additional address bits are used for bank selection. One major benefit provided by SDRAMs is pipelining. While one bank is being accessed, another bank can be refreshed or precharged in the background.
Despite these differences, SDRAM organization is very similar to asynchronous DRAM organization. In fact, many memory controllers for asynchronous DRAMs support multiple banks and background refreshing and precharging operations.
DRAM chips can be organized to form main memory systems in a variety of ways. Typically the width and speed of the system bus are matched to the width and speed of the main memory system bus by providing the main memory system bus with the same bandwidth as the system bus. Usually system busses are both faster and wider than the data I/O interface provided by individual DRAM chips, so multiple DRAM chips are arranged in parallel to match the bandwidth of the system bus. If a particular computer system has a 16 byte wide data bus that operates at 66 MHz, then a main memory subsystem of the computer system that operates at 33 MHz and is constructed with 4-bit wide DRAM chips will typically have 64 DRAM chips arranged in each bank, thereby providing each bank with a bandwidth of nearly a gigabyte per second, which matches the bandwidth of the system data bus. If the bandwidths are not matched, other techniques may be employed, such as using a small FIFO to buffer memory accesses and blocking memory accesses when the FIFO is full.
It is also common for computers to use cache memories to increase performance. A cache memory holds a subset of the contents of main memory and is faster and smaller than main memory. An architecture common in the art provides a level one (L1) cache on the same integrated circuit as the microprocessor, and a level 2 (L2) cache on the system board of the computer. L1 cache sizes are generally in the range of 8 kilobytes to 128 kilobytes, and L2 cache sizes are generally in the range of 256K bytes to 4M bytes. The smallest unit of memory that can be loaded into a cache memory is known in the art as a cache line.
In a computer system having 16 byte wide system and memory data busses, assume that the cache line size is 64 bytes. Therefore, it will generally take four bus clock ticks for each load to, or store from, a cache line. If the computer""s data bus is clocked faster than the DRAMs, it is common to use a small pipelining FIFO to match the speeds, as discussed above. Another alternative is to use a wider memory bus and multiplex it at high speeds onto the Computer""s data bus, which is also discussed above.
For simplicity, assume that the computer system described above uses a small pipelining FIFO. If each bank is arranged as 32 4-bit wide DRAM chips to match the width of the system data bus, the minimum memory increment (a single bank) is 32 4-megabit chips, which is 16 megabytes. Typically, when addressing more memory, in order to keep physical memory contiguous, the high order physical address bits are used to select different banks.
Consider a logical address comprised of 32 bits, with each address value capable of being represented by the logical address indexing a byte. In the computer system described above, the memory bus is 16 bytes wide. Therefore, the least significant 4 address bits are ignored because these bits are implicitly represented by the arrangement of DRAM chips within a bank. Since the cache lines are 64 bytes long, the next 2 address bits act as column index bits that are cycled to obtain an entire cache line. The next 7 bits are assigned to the remaining column address bits, and the next 11 bits are assigned to the row address bits. Note that FIG. 1 (PRIOR ART) shows nine column address bits because the two address bits that are used to access a cache line are manipulated by a memory controller, which is not shown in FIG. 1 (PRIOR ART).
This is a total of 24 bits, which correctly matches the 16-megabyte memory bank size discussed above. The address bits between the 25th and 32nd bit, inclusive, are used to select a particular memory bank, or are unused. The first bank of 32 chips will be selected if addresses are within the first 16 megabytes of the physical address range, the second bank of 32 chips will be selected if addresses are within the next 16 megabytes, and so on. This is the simplest and most common way to select different groups of memory chips, and is widely used in the art.
Consider page buffers of the DRAM chips that form a single bank. All the individual page buffers are accessed in parallel, thereby combining to form a larger xe2x80x9clogicalxe2x80x9d page buffer. As shown in FIG. 1 (PRIOR ART), each DRAM chip 10 has a 2,048 bit, or 256 byte, page buffer 14. Since 32 chips are arranged in parallel, the logical page buffer is 8,192 bytes wide. If the low order address bits are used to index columns, two memory locations having addresses that differ only in the lower 13 bits of the logical memory address will be in the same row, and therefore will be available in a logical page buffer concurrently.
Each bank of 32 parallel DRAM chips has its own set of page buffers. Therefore, a logical page buffer exists for each memory bank provided in the computer system. If the high order address bits are used to select banks, as described previously, then there is an 8 kilobyte logical page buffer for the first 16 megabytes of physical memory, another 8 kilobyte logical page buffer for the next 16 megabytes of physical memory, and so on.
If the system described above employed SDRAMs having bank select bits, the internal banks of the SDRAMs may be viewed as collections of relatively independent banks of DRAMs, with the high order address bits used as bank select bits. Accordingly, for the purpose of illustrating the present invention below, there is little difference between memory banks that are derived from collections of chips addressed independently, and memory banks that are derived from bank select inputs to specific SDRAM chips.
Consider a typical cache line read in the system described above. First, the appropriate bank is selected, and then a row is transferred into the logical page buffers. This takes approximately 30 ns. Next, 4 16-byte chunks are read from the logical page buffer; this takes approximately 60 ns (30 ns for the first 16 byte chunk, and 10 ns for each of the next three 16 byte chunks), and provides a complete cache line. Finally, the logical page buffer is closed; this takes 40 ns. The total time was 130 ns. The time before the first word was read was 60 ns (page open plus first column access). Many systems are configured such that the first word available is the first word required by the CPU. The time required to retrieve the first word is known in the art as the xe2x80x9ccritical word latencyxe2x80x9d.
It is common in the art for a memory controller to gamble that successive references to the same memory bank will access the same row (or page). Such a memory controller is known as a page mode memory controller. A page hit occurs when the memory controller processes a memory access request, and finds that the row that needs to be accessed is already in the logical page buffer. In a page mode memory controller, the page is not closed after an access. Instead, the page is only closed when an access to that bank requires a different page or a refresh cycle occurs.
If a subsequent memory access is indeed for the same page, then the critical word latency is shortened from 60 ns to just 30 ns, a significant savings. If a subsequent memory access is not for the same page, then a penalty is incurred. The old page stored in the logical page buffer must undergo a precharge cycle before a new page can be opened, so the critical word latency is 40 ns (precharge) plus 30 ns (row access) plus 30 ns (first word available), or 100 ns, quite a bit more than the previous value of 60 ns that is achieved when the logical page buffer is precharged after every access.
If p is the probability that the next access is on the same page, then the average critical word latency is 30 ns*p+100 ns*(1xe2x88x92p), (or 100 nsxe2x88x9270 ns*p). Note that the critical word latency decreases asp increases. The point at which the gamble pays off is when the average critical word latency is 60 ns, which, as described above, is the critical word latency achieved when the logical page buffer is closed after each memory access. Accordingly, the point at which it pays to keep the logical page buffer open after each access occurs when there is a greater than 0.571 probability that a sequential memory access will reference the same page.
Assume that in a computer system having a page mode memory controller requests are fed to the memory controller as fast as they can be consumed. Each time a page in a bank is accessed for the first time requires a precharge cycle to close the old page and a row access page to open the new page, which together require 70 ns. As described above, each cache line access from an open page requires 60 ns. Thus, an average cache line access requires 60 ns+70 ns (1xe2x88x92p), (or 130 nsxe2x88x9270 ns*p). In contrast, as discussed above, a non-page mode memory controller requires 90 ns. Similarly, the point at which it pays to keep the logical page buffer open after each access occurs when there is a greater than 0.571 probability that a sequential memory access will reference the same page.
As the processor clock rates and scalarity increases, it is becoming more important to reduce memory latency. A method to improve the success rate of a page hit in a page mode memory controller and reduce memory latency so as to improve the overall performance of a computer system has long been sought but has eluded those skilled in the art.
The present invention provides a method and an apparatus to improve the success rate of a page hit in a page mode memory controller and reduce memory latency so as to improve the overall performance of a computer system.
The present invention provides a method and an apparatus for addressing a main memory unit in a computer system which results in improved page hit rate and reduced memory latency by only keeping open some recently used pages and speculatively closing the remaining pages in the main memory unit.
The present invention further provides a method of accessing a main memory unit using a logical address generated by a processor in a computer system. The main memory unit is organized in banks, rows and columns and is addressed via bank, row and column bits. Each bank includes a plurality of rows, and the computer system includes a memory controller coupled between the processor and the main memory unit. The memory controller includes a plurality of bank controllers that are addressed by the bank bits, and the bank bits are maintained in a queue or stack. The method includes (1) selecting a predetermined number of bank controllers, each selected bank controller opens a corresponding bank, and each opened bank opens one of the plurality of rows; (2) maintaining the bank bits of the predetermined number of bank controllers using the queue.
The present invention still further provides a computer system in which page hit rate is optimized and memory latency is reduced. The computer system includes: (1) a processor that generates a logical address to facilitate memory accesses, the logical address includes bank index bits; (2) a main memory unit that is organized in banks, rows and columns; and (3) a memory controller coupled between the processor and the main memory unit. The main memory unit is addressed via bank, row and column bits, and each bank includes a plurality of rows. The memory controller generates from the logical address the bank, row, and column bits required by the main memory unit. The bank index bits are used to generate one or more bank bits, and includes a plurality of bank controllers that are addressed by the bank bits. The bank bits are maintained in a queue, and the queue is operable to select a predetermined number of bank controllers.
The above and additional advantages of the present invention will become apparent to those skilled in the art from a reading of the following detailed description when taken in conjunction with the accompanying drawings.