1. Field of the Invention
The present invention relates to a high performance memory architecture for digital computer and video systems, and more particularly, to an architecture using two or more independent and cooperative channels that may be used to provide enhanced functionality, flexibility and performance.
2. Description of Related Art
Semiconductor computer memory integrated circuits (ICs) have traditionally utilized an internal architecture defined in an array having rows and columns, with the row-column address intersections defining individual data storage locations. These intersections are typically addressed through an internal dedicated address bus, and the data to be stored or read from the locations is typically transferred on a separate dedicated internal input/output (I/O) bus. In a similar manner, the data and address information is communicated between the memory IC and external devices by use of separate dedicated paths. Semiconductor memory configurations utilizing this basic architecture include Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Electrically Programmable Read Only Memory (EPROM), Erasable EPROM (EEPROM), and so-called "flash" memory.
The architectural constraint of dedicated address and I/O buses has arisen due to simplicity and practical considerations relating to microprocessor operating frequency. Traditionally, microprocessors operated at a cycle time that was greater than or equal to the memory device latency (also referred to as access time), and the memory device I/O bus would become inefficiently delayed if address information were to be multiplexed onto the I/O bus. The advent of faster speed, pipelined microprocessors has challenged this basic architectural premise, and has driven a demand for ever increasing performance from the memory architecture with minimal associated cost increase.
One of the most important measures of performance for such memory devices is the total usable data bandwidth of the I/O bus, typically measured in megabytes per second (MB/sec). For example, a memory device having a single eight-bit (1 byte) wide bus that can operate at a maximum frequency of 33 MHz can deliver data at a rate of 33 MB/sec. Data bandwidth is often quoted as a theoretical maximum, or "peak bandwidth," which assumes the perfect case wherein data is transferred at a rate of a single byte per cycle without interruption. More realistically, however, the maximum usable bandwidth for a particular memory application is a somewhat smaller number and can vary widely for different applications utilizing the same memory device. For most applications with most types of memory devices, it is difficult in practice to utilize more than about 70% to 80% of the peak bandwidth of a memory device.
Usable data bandwidth is affected by two main types of timing delays: (1) access time; and (2) cycle time. Access time is defined as the delay between the arrival of new address information at the address bus and the availability of the accessed data on the I/O bus. Cycle time is determined by the maximum rate that new data can be provided to the memory device outputs. For standard DRAMs, cycle time for row accesses is significantly greater than associated access time due to the necessity of additional memory cell pre-charge operations before or after a memory access transaction; by comparison, asynchronous SRAMs have roughly equivalent cycle time and access time. Newer pipelined configurations of DRAMs and synchronous SRAMs have a cycle time that is significantly less than their access time.
Microprocessor frequency and performance have increased dramatically over recent years, and there has been an associated effort to increase the usable data bandwidth of memory devices in order to keep pace with the higher speed microprocessors. There has also been a trend toward reduction of the number of individual memory chips per system due in part to the faster increase in available memory chip density as compared to the average increase in system main memory size. Emerging multimedia and portable applications will generally require a very small number of DRAMs per system, e.g., less than four. Previous efforts to increase the usable data bandwidth have concentrated primarily on reducing access time and/or cycle time.
The least complex method of improving data bandwidth is to utilize SRAM chips having much faster access and cycle times than the much cheaper and higher density DRAMs; access time for typical SRAMs is in the range of approximately 15 nanoseconds (ns) versus 60 ns for DRAMs. SRAMs have a relatively high current output differential memory cell that is conducive to faster sensing techniques which enable the fast access time. While SRAMs have much higher data bandwidth, they are not without drawbacks. In particular, SRAMs require more area within the chip to store the same amount of information as DRAMs, and are thus more expensive to manufacture for a given memory chip density. In view of these drawbacks, SRAMs are usually confined to certain high performance applications requiring particularly fast access in which the additional cost can be justified.
One such application of very high speed SRAMs is for use within a secondary cache that is located off-chip from the microprocessor. Present microprocessors generally include an on-chip memory, known as a primary cache. The primary cache stores information that is most frequently used by the microprocessor, such as instructions and data. To increase microprocessor speed, it is desirable to increase the percentage of memory accesses in which desired data is stored in the primary cache, known as the "hit rate." Primary cache hit rates may typically range from 80% to 90%. If the desired data is not available in the primary cache, known as a "miss," the microprocessor then must access the data from the off-chip memory. A data miss can represent a significant time penalty for current microprocessors due to their extremely fast cycle time and their execution of multiple instructions in each cycle.
By adding some number of fast SRAMs off-chip, known as a secondary cache, the hit rate can be further increased to the 95% range. Low-end secondary cache applications generally combine multiple commodity eight-bit wide SRAM chips that operate at 12 to 15 ns access/cycle time in order to provide a thirty two-bit or sixty four-bit bus. The usable data bandwidth for such a secondary cache can be as much as 256 to 512 MB/sec. Despite the hit rate improvement due to secondary cache, however, even a 5% miss rate may be intolerable for certain systems. Another drawback of secondary cache memory is cost, because off-chip SRAM chips are three to four times more expensive than DRAM chips for the same bit density. For a microprocessor with a reasonably sized primary cache, e.g., sixteen to thirty two kilobytes (KB), the overall system improvement with a secondary cache may only be 10% to 15%, which may not be sufficient to justify the substantial increase in cost. Further drawbacks of secondary cache include printed circuit board area to contain the secondary cache, power dissipation and design complexity.
The secondary cache concept can be extended further for high end systems by defining specific SRAM architectures utilizing very wide data buses, pipelining techniques, burst counters and very high clock frequencies. Some of the new secondary cache SRAM configurations under development have sixteen-bit wide and even thirty two-bit wide buses, and operate at frequencies exceeding 100 MHz. These configurations provide very good performance, including usable data bandwidth on the order of 200 to 400 MB/sec per chip.
Notwithstanding these benefits, production costs of such new secondary cache SRAM configurations are significantly higher than commodity DRAMs. Further, die size, heat dissipation and pin counts are larger, and the device may require relatively expensive packaging, e.g., ceramic rather than plastic. Thus, large width secondary cache SRAMs are appropriate for certain applications, such as high end work stations and personal computers, but are too expensive and power consuming for consumer applications, such as lap-top computers, hand-held personal data assistants (PDAs), and the like, many of which still demand the highest possible level of performance.
In contrast, DRAMs are by far the most widely used type of semiconductor memory device due to their low cost and reasonable data bandwidth. At present, however, state of the art DRAMs have significantly slower random row access times than SRAMs, having approximately 60 ns access and 100 ns cycle times to randomly access any row within the device. This performance corresponds to an operating frequency of only 10 MHz, which for an eight-bit wide DRAM equals a data bandwidth of only 10 MB/sec. Current standard microprocessors are already approaching 100 MHz internal cycle times. For a thirty-two bit memory data bus, the performance requirement would be 400 MB/sec, which is far beyond the bandwidth capacity of the standard DRAM architecture.
Until the present time, most DRAMs have been used in memory systems containing many individual DRAM chips (such as large computers, work stations, etc.). In such systems, slightly higher bandwidth could be achieved at the system level by accessing many DRAMs in parallel at the same time to provide a wider effective memory bus, and also by use of so-called interleaving techniques. Interleaving comprises the use of two groups of DRAMs in which a first group of DRAMs is accessed while an access operation of a second group is already underway. The returning data from both groups is then provided on the same data bus, but out of phase from each other. The use of separate address and data buses makes interleaving possible. Performance improvement techniques such as interleaving can reduce the memory system bus cycle time by up to half that of any of the DRAMs individually; however, these techniques are unusable for the majority of systems that contain only a very small number of DRAMs. Finally, the DRAM performance shortfall can be mitigated somewhat by increasing the data bus width of the individual DRAM chips, such as from eight to sixteen bits. Still, this is a short term improvement that has cost drawbacks in terms of increased chip area, power requirement, and package cost, without significantly closing the performance gap between DRAM bandwidth and system requirements.
All DRAM architectures have an inherently faster column access time than their associated row access time. This aspect arises due to the very nature of the DRAM memory cells and sensing structures. Each DRAM memory cell is a very small capacitor which provides a very small voltage signal on the bit lines that must be read very slowly and carefully in order to guarantee that the correct data is sensed during a row read operation. The small voltage levels are actually destroyed during the read operation, providing a so-called destructive read. The sensing operation must therefore restore full voltage levels back on the bit lines in order to refresh the memory cell upon completion of the row access. Due to this requirement, the bit line sense amplifiers must be constructed of differential latch circuits which are topologically similar to a six transistor SRAM memory cell.
As a natural consequence of this configuration, completely random row access times into any memory cell location of the DRAM are very long (on the order of 60 ns for current technology). In a column access, by contrast, data is read directly from the much smaller number of enabled high current bitline sense amplifiers, providing a row cache capability. Therefore, the access time for column access operations can be more than twice as fast as for a row access, and is analogous to the faster SRAM access time. Unfortunately, only a very small percentage of the memory device can be accessed at the faster latency at any given time. Memory system designers have attempted to structure the devices to take advantage of the faster column access, albeit with limited success. The current trend toward smaller numbers of chips within a memory system greatly exacerbates this situation, as the number of independent row locations that can be enabled to provide the faster column accesses is reduced as the number of chips per system decreases. As a result, the "hit rate" on these so-called row caches becomes so low that the increased complexity associated with this mode is not justified in terms of cost and performance of the final system.
A different approach to improving data bandwidth is to concentrate instead on reducing the cycle time of the memory device, rather than the access time. One way to provide a cycle time enhancement is to utilize a pipelined structure, as in the synchronous SRAMs mentioned previously. Other examples of pipelined structures include Extended Data Out (EDO) DRAMs, synchronous DRAMs, and RAMBUS.TM. DRAMs (or RDRAMs).
The most evolutionary modification to the standard DRAM architecture is the advent of the EDO DRAMs. This is a relatively minor modification to the column mode circuitry of the DRAM which allows a faster usable column mode cycle time within a system environment. Row and column access times are unchanged from the standard DRAM architecture. A further enhancement of this technique is a Burst EDO, in which a burst counter is added to the DRAM to allow sequential data bytes stored within the device to be accessed without having to supply new addresses for each data byte. Instead, the burst counter supplies the addresses. These EDO DRAM architectures are expected to achieve data bandwidth rates in the 132 MB/sec range for sixteen-bit devices in the near future.
A more sophisticated evolutionary approach to achieving higher pipelined bandwidth is the synchronous DRAM, in which a master clock signal and other architectural enhancements provide faster usable burst cycle times than either EDO or Burst EDO DRAMs. In synchronous DRAMs, data available on column sense amplifiers following a row access is used to create a burst mode, which can reduce cycle time to the 10 ns range. Subsequent data can be streamed out from the column sense amplifiers at 10 ns intervals due to the pipeline configuration, which results in a 100 MHz data rate. Synchronous DRAMs have a random access time of about 60 ns, and a column access time of about 30 ns. At present, synchronous DRAM configurations are 8-bits wide and 16-bits wide, the latter providing a peak data bandwidth of 200 MB/sec.
A further architectural feature of synchronous DRAMs is the subdivision of the memory array into two independently addressable banks to improve channel utilization in certain applications. While one bank is in the middle of a burst transaction, row addresses can be provided to the other bank to minimize a row access latency. This can be useful in applications such as graphics, in which the next row addresses can be provided during burst transactions from the other bank. This allows very high utilization of peak data bandwidth for some graphics applications. However, the two banks provide only two independent row cache locations, which are not enough to allow significant row cache hit rates for systems with a small number of DRAM chips.
The RAMBUS.TM. DRAMs increase the pipelined burst bandwidth by use of an extremely fast nine-bit wide channel that achieves peak bandwidth of up to 528 MB/sec, as described in U.S. Pat. No. 5,319,755. System cost is reduced by the multiplexing of all address, data and control information on the single nine-bit bus, which becomes feasible due to the extremely fast burst cycle time. Actual usable bandwidth can actually be significantly lower than this, due to very long row and column latencies (e.g., up to 128 ns for current sixteen Mbit RAMBUS.TM. versions) as compared with other DRAM architectures, and the inability to hide these latency penalties due to the single channel for address, data and control information. The RAMBUS.TM. DRAMs provide two memory banks per chip, which can theoretically provide a usable system row cache for systems having a large number of chips on the single channel; however, this row cache concept would not be effective for relatively small systems that are expected in the future, due to the very small number of total row cache locations available to the system.
Yet another approach to increasing usable data bandwidth for virtually any application is to increase the number of ports that are available to access the individual storage locations within the memory device. For video graphics applications, specialized architectures called VRAMs have been implemented. A VRAM comprises a dual-port DRAM architecture optimized for graphics or video frame buffers. In bit-mapped graphics, pixels on the screen are directly mapped to locations within a DRAM frame buffer array. The number of DRAM bits per pixel can be as much as 32 bits/pixel or more for certain high end applications, such as three-dimensional graphics.
The VRAM primary port is similar to that of a common DRAM, with row and column access modes, but no burst mode. The secondary port is a serial port that provides a serial output from the VRAM frame buffer to update the red, green, and blue outputs to the video screen, typically through a RAMDAC, e.g., a color look-up table. Improved performance results from this secondary serial port, as all data is taken from the secondary port for updating the video without impacting throughput of the primary port. Thus, the peak usable data bandwidth represents the sum of the bandwidth associated with the primary DRAM port, plus the bandwidth associated with the secondary port.
Although improved performance is attained, VRAMs have a severe cost penalty relative to commodity DRAMs. The chip die size penalty for the serial port registers is substantial, and in some cases can be as much as 50%. Other factors contributing to the higher cost are lower sales volume, higher package cost due to more pins, and added features geared for graphics applications that are under-utilized in practice. Further, the two VRAM ports operate completely independently and cannot cooperate to improve the utilization of each other. Typically, VRAMs have a 60 ns random access time for the primary port, and a 25 ns serial-mode access time for the serial port. As the frequency of competing memory architectures continues to increase (as in the present invention), the advantages of the VRAM architecture will diminish since the screen refresh rate will eventually represent only a small percentage of total available bandwidth. Thus, it would be inefficient to dedicate a port solely to providing screen refresh.
Finally, a dual port SRAM architecture has been implemented for use with certain multi-processing applications. This architecture has two functionally identical ports that can generally operate simultaneously to access any memory location within the array, including the same memory cell (although simultaneous writes to a common location are not permitted to avoid data ambiguity). Resultant dual port SRAM data bandwidth is equivalent to both ports operating simultaneously at maximum frequency. This architecture typically has access times comparable to or slightly slower than a normal SRAM, from either port.
Unfortunately, this performance advantage is obtained at a significant cost. The memory cells themselves have a dual port structure, with essentially a complete replication of the entire address and data path from the memory cells to the external chip connections for each port. Other than some arbitration logic, the circuitry is not significantly different from standard SRAM circuits. The dual port SRAM devices are expensive because the memory cells and replicated periphery require two to four times the amount of area on the chip than normal SRAM circuitry for the same density, and the packaging cost is higher than normal because of the larger number of external connections for the dual address and data buses. In fact, dual port SRAMs can be four to eight times as costly as commodity DRAMs at a cost per bit basis. As a result of this high cost, use of dual port SRAMs is limited to a narrow range of applications which do not require a very large number of memory locations (which further exacerbates the cost penalty due to lower manufacturing volumes). Since the dual port SRAMs have separate address buses for each channel, there is no significant need or capability for the channels to cooperate with each other to enhance the usable data bandwidth.
In summary, a critical need exists for a multiple port, high data bandwidth memory architecture that can be implemented at low cost, with little additional silicon area, very low power dissipation, and with the minimum possible number of pins on an IC package. The architecture should permit flexibility in its use, and should minimize latency, cycle time, and utilization penalties by permitting the channels to operate completely independent of each other when appropriate, or to intelligently cooperate with each other to increase performance. Preferably, such architecture should be readily implemented using high density commodity integrated circuit chips, such as DRAMs, EEPROMs or flash memory, and should even be capable of further improving the usable data bandwidth of commodity SRAMs.