The present invention relates to Dynamic Random Access Memory (DRAM) controllers. In particular, the present invention discloses an improved memory controller that provides for better memory data bus utilization for a random series of memory accesses.
Digital data processing products comprise one or more processors. These processors are electrically coupled to input/output devices such as disk storage, tape storage, keyboards, and displays, for examples. The processors are also coupled to a memory. The memory is often configured as a hierarchy to provide a tradeoff between the costs of each level in the hierarchy, the size of each level, the access time to receive data from each level, and the bandwidth available to transfer data to or from each level.
For example, a level-1 cache (L1 cache) is usually placed physically on the same chip as a processor. Typically the processor can access data from L1 cache in one or two processor clock cycles. L1 cache is normally optimized for latency, meaning that the primary design goal is to get data from the L1 cache to the processor as quickly as possible. L1 caches are usually designed in Static Random Access Memory (SRAM) and occupy a relatively large amount of space per bit of memory on the semiconductor chip. As such, the cost per bit is high. L1 caches are typically designed to hold 32,000 bytes (32 KB) to 512 KB of data.
A level-2 (L2 cache) is normally designed to hold much more information than an L1 cache. The L2 cache usually contains 512 KB to 16,000,000 bytes (16 MB) of data storage capacity. The L2 cache is typically also implemented with SRAM memory, but in some cases, is implemented as DRAM. The L2 cache typically takes several cycles to access.
A level-3 (L3 cache) is normally designed to hold much more information than an L2 cache. The L3 cache typically contains from 16 MB to 256 MB, and is commonly implemented with DRAM memory. The L3 cache is frequently on separate semiconductor chips from the processor, with signals coupling the processor with the L3 cache. These signals are routed on modules and printed wiring boards (PWB""s).
A main memory is almost always implemented in DRAM memory technology, and is optimized for low cost per bit, as well as size. Today""s large computers have main memory storage capacities of many gigabytes.
FIG. 1 shows a high-level block diagram of a computer. The computer comprises one or more processors. Modern computers may have a single processor, two processors, four processors, eight processors, 16 processors, or more. Processors 2A-2N are coupled to a memory 6 by a memory controller 4. Memory 6 can be any level of cache or main memory; in particular, memory 6 is advantageously implemented in DRAM for the present invention. A processor data bus 3 couples processors 2A-2N to memory controller 4. A memory data bus 5 couples memory controller 4 to memory 6. Optimizing the use of the bandwidth available on memory data bus 5 is important to maximize the throughput of the computer system. Memory data bus 5 should not be idle when there are outstanding requests for data from processors 2A-2N. A conventional memory controller comprises a number of command sequencers 8. Each command sequencer 8 manages one request at a time (a load request or a store request), and the command sequencer 8, when in control of memory data bus 5, is responsible for driving the Row Address Strobe (RAS), the Column Address Strobe (CAS), and any other associated control signals to memory 6 over memory data bus 5. Control typically passes from one command sequencer 8 to another command sequencer 8 in a round robin fashion. Memory controller 4 strives to make sure that each command sequencer 8 has a request to handle, to the degree possible in the current workload.
FIG. 2 is a more detailed view of memory 6, showing that memory 6 comprises banks bank 0, bank 1, bank 2, and bank 3. Four banks are shown for exemplary purposes, but more or fewer banks could be implemented in a particular design. Each bank has timing requirements that must be complied with. In some applications, e.g., numeric intensive applications, a particular type of DRAM, the Synchronous DRAM (SDRAM) can be operated in page mode, with many accesses to the same page, where a page is the same as a bank. Commercial workloads have a high percentage of random accesses so page mode does not provide any performance benefit. In non-page mode, SDRAMs are designed for peak performance when consecutive accesses are performed to different banks. A read is first performed by opening a bank with a RAS (Row Address Strobe) to open a bank, waiting the requisite number of cycles, applying a CAS (Column Address Strobe), waiting the requisite number of cycles, after which the data is transmitted from the bank into the memory controller 4. Memory controller 4 must wait several cycles for the row in the bank to precharge (tRP) before reactivating that bank. A write is performed by opening a bank (RAS), issuing a write command along with a CAS, and transmitting data from memory controller 4 to the SDRAMs in the opened bank. That bank cannot be re-accessed until a write recovery (tWR) has elapsed, as well as the row precharge time (tRP).
Switching the SDRAM data bus from performing a read to a write is expensive in terms of time, requiring the amount of time to clear the data bus of the read data from the last read command. When switching from writes to reads, the write data must be sent to the SDRAMs and the write recovery time must complete before a read command can be sent. The penalty incurred when switching from reads to writes, or writes to reads, is called the bus turnaround penalty.
FIGS. 3A-3E provide an example, using reads, showing how bandwidth on memory data bus 5 can be wasted if data from a particular bank is repeatedly accessed.
FIG. 3A lists the timing rules in the example. RAS-CAS delay is 3 cycles. RAS-RAS delay, when the same bank is being addressed is 11 cycles. CAS-RAS delay, when addressing a different bank is one cycle. CAS-data delay is 3 cycles. A data transmittal, seen in FIGS. 3B-3E requires four bus cycles.
FIG. 3B shows the sequential use of a single bank. Data A and data B are presumed to be in the same bank. That bank is opened with a RAS at cycle 1. The CAS is on cycle 4. Data is transmitted from that bank over memory data bus 5 to memory controller 4 during cycles 7, 8, 9, and 10. Because of the RAS-RAS 11-cycle requirement when the same bank is addressed, the bank cannot be opened again to read data B until cycle 12. The CAS for reading data B is sent on cycle 15, and data B is transmitted from that bank over memory data bus 5 to memory controller 4 on cycles 18, 19, 20, and 21. Note that, in this example, memory data bus 5 is not utilized on cycles 11, 12, 13, 14, 15, 16, and 17. As stated above, memory data bus 5 is used far more efficiently when consecutive accesses are to different banks.
FIG. 3C shows optimal memory data bus 5 usage when consecutive reads are to different banks. Requests A, B, C, and D are for data in separate banks. The RAS for data A is sent at cycle 1; the CAS for data A is sent at cycle 4. The RAS for data B can be sent at cycle 5, per the rules given in FIG. 3A. The CAS for data B is sent at cycle 8. Similarly, the RAS and CAS for data C are sent on cycles 9 and 12. The RAS and CAS for data D are sent on cycles 13 and 16. Memory data bus 5 is kept 100% busy once data transmittal has started.
FIG. 3D shows a case where requests for A, B, C, and D are consecutive requests from processors 2A-2N, but where data A and data C are in the same bank. Using the timing requirements of FIG. 3A, the bank containing data C cannot be reopened until the 12th cycle. This causes a 3-cycle gap in memory data bus 5 utilization, as shown in FIG. 3D.
FIG. 3E shows how memory access requests can be reordered, and will be described in detail later in terms of the disclosed invention.
The memory controller has a very complicated task of managing the bank timings, maximizing the utilization of the memory data bus, and prioritizing reads over writes, when possible. Furthermore, often, requests to access the same memory bank exist in multiple command sequencers. Such requests to access the same memory bank can cause gaps in memory data bus usage in a round robin command sequence activation scheme; alternatively, prioritization of the command sequencers can be accomplished only through extremely complicated logic and a large number of wires coupling the various command sequencers.
Therefore, there is a need for a memory controller design that improves the management of memory bank control, allowing for easier optimization of the memory data bus utilization.
The present invention is a method and apparatus that provides an improved memory controller that optimizes memory bus utilization for a series of memory accesses.
The present invention discloses a computer system with a memory, a memory controller, and a processor, wherein the memory controller is capable of reordering load and store requests in order to optimize the use of a memory data bus.
The present invention discloses a computer system with a memory controller having a dedicated bank sequencer for each memory bank. Each bank sequencer maintains queues of load and store requests destined for the bank for which the bank sequencer is dedicated. Each bank sequencer maintains timing information for its bank and does not forward requests to a central controller until its bank is available to service the request. The central controller receives requests, which are therefore already guaranteed to comply with bank timing requirements. The central controller can then dispatch requests to the memory based on predetermined priorities, without having to consider whether a particular request is valid from a bank timing requirement.
In an embodiment, the central controller comprises a single data bus sequencer. The single data bus sequencer advantageously comprises a read data bus sequencer and a write data bus sequencer. Since all requests to the memory controller are guaranteed to comply with bank timing requirements, the central controller can move any request forwarded to the memory controller to the data bus sequencer at the discretion of the memory controller for immediate execution on the memory data bus. The data bus sequencer does not have to be capable of delaying execution of the request; it need only be designed to comply with the RAS, CAS, and other control timing requirements of the memory.