1. Field of the Invention
The present invention relates to a method and structure for implementing a memory system. More specifically, the invention relates to a second level cache memory.
2. Description of the Prior Art
High-speed computer systems frequently use fast, small-capacity cache (buffer) memory to transmit signals between a fast processor and a slow (and low cost), large-capacity main memory. Cache memory is typically used to temporarily store data which has a high probability of being selected next by the processor. By storing this high probability data in a fast cache memory, the average speed of data access for the computer system is increased. Thus, cache memory is a cost effective way to boost system performance (as compared to using all high speed, expensive memories). In more advanced computer systems, there are multiple levels (usually two levels) of cache memory. The first level cache memory, typically having a storage of 4 Kbytes to 32 Kbytes, is ultra-fast and is usually integrated on the same chip with the processor. The first level cache is faster because it is integrated with the processor and therefore avoids any delay associated with transmitting signals to and receiving signals from an external chip. The second level cache is usually located on a different chip than the processor, and has a larger capacity, usually from 64 Kbytes to 1024 Kbytes.
FIG. 1 is a block diagram of a prior art computer system 100 using an SRAM second level cache configuration. The CPU or microprocessor 101 incorporates on-chip SRAM first level cache 102 to support the very fast internal CPU operations (typically from 33 Mhz to 150 Mhz).
First level cache 102 typically has a capacity of 4 Kbytes to 32 Kbytes and performs very high speed data and instruction accesses (typically with 5 to 15 ns). For first-level cache miss or other non-cacheable memory accesses, the memory read and write operations must go off-chip through the much slower external CPU bus 104 (typically from 25 Mhz to 60 Mhz) to the SRAM second level (L2) cache 106 (typically with 128 Kbytes to 1024 Kbytes capacity) with the additional latency (access time) penalty of round-trip off-chip delay.
The need for CPU 101 to manage the delay penalty of off-chip operation dictates that in almost all modern microprocessors, the fastest access cycle (read or write) through the CPU bus 104 is 2-1-1-1. That is, the first external access will consume at least 2 clock cycles, and each subsequent external access will consume a single clock cycle. At higher CPU bus frequencies, the fastest first external access may take 3 or more clock cycles. A burst cycle having 4 accesses is mentioned here for purposes of illustration only. Some processors allow shorter (e.g., 2) or longer (e.g., 8 or more) burst cycles. Pipelined operation, where the parameters of the first external access of the second burst cycle are latched into CPU bus devices while the first burst cycle is still in progress, may hide the longer access latency for the first external access of the second burst cycle. Thus, the first and second access cycles may be 2-1-1-1, 1-1-1-1, respectively.
The cache tag memory 108 is usually relative small (from 8 Kbytes to 32 Kbytes) and fast (typically from 10 to 15 ns) and is implemented using SRAM cells. Cache tag memory 108 stores the addresses of the cache lines of second level cache 106 and compares these addresses with an access address on CPU bus 104 to determine if a cache hit has occurred. This small cache tag memory 108 can be integrated with the system logic controller chip 110 for better speed and lower cost. An integrated cache tag memory operates in the same manner as an external cache tag memory. Intel""s 82430 PCI set for the Pentium processor is one example of a logic controller chip 110 which utilizes an SRAM integrated cache tag memory.
One reason for the slower operating frequency of CPU bus 104 is the significant loading caused by the devices attached to CPU bus 104. Second level (L2) SRAM cache memory 106 provides loading on the data and address buses (through latch 112) of CPU bus 104. Cache tag memory 168 provides loading on the address bus, system logic controller chip 110 provides loading on the control, data and address buses, and main memory DRAM 114 provides loading on the data bus (through latch 116).
In prior art computer system 100, the system logic chip 110 provides an interface to a system (local) bus 118 having a typical operating frequency of 25 Mhz to 33 Mhz. System bus 118 may be attached to a variety of relatively fast devices 120 (such as graphics, video, communication, or fast disk drive subsystems). System bus 118 can also be connected to a bridge or buffer device 122 for connecting to a general purpose (slower) extension bus 124 (at 4 Mhz to 16 Mhz operating frequency) that may have many peripheral devices (not shown) attached to it.
Traditional high speed cache systems, whether first level or second level, are implemented using static random access memories (SRAMs) because the SRAMs are fast (with access times ranging from 7 to 25 nanoseconds (ns) and cycle times equal to access times). SRAMs are suitable for storing and retrieving data from high-speed microprocessors having bus speeds of 25 to 100 megahertz. Traditional dynamic random access memories (DRAMs), are less expensive than SRAMs on a per bit basis because DRAM has a much smaller cell size. For example, a DRAM cell is typically one quarter of the size of an SRAM cell using comparable lithography rules. DRAMs are generally not considered to be suitable for high speed operation because DRAM accesses inherently require a two-step process having access times ranging from 50 to 120 ns and cycle times ranging from 90 to 200 ns.
Access speed is a relative measurement. That is, while DRAMs are slower than SRAMs, they are much faster than other earlier-era memory devices such as ferrite core and charge-coupled devices (CCD). As a result, DRAM could theoretically be used as a xe2x80x9ccachexe2x80x9d memory in systems which use these slower memory devices as a xe2x80x9cmain memory.xe2x80x9d The operation modes and access methods, however, are different from the operation modes and access methods disclosed herein.
In most computer systems, the second level cache operates in a fixed and rigid mode. That is, any read or write access to the second level cache is of a few constant sizes (line sizes of the first and second level caches) and is usually in a burst sequence of 4 or 8 words (i.e., consecutive reads or writes of 4 or 8 words) or in a single access (i.e., one word). These types of accesses allow standard SRAMs to be modified to allow these SRAMs to meet the timing requirements of very high speed processor buses. One such example is the burst or synchronous SRAM, which incorporates an internal counter and a memory clock to increment an initial access address. External addresses are not required after the first access, thereby allowing the SRAM to operate faster after the first access is performed. The synchronous SRAM may also have special logic to provide preset address sequences, such as Intel""s interleaved address sequence. Such performance enhancement, however, does not reduce the cost of using SRAM cells to store memory bits.
Synchronous DRAMs (SDRAM) have adopted similar burst-mode operation. Video RAMs (VRAM) have adopted the serial port operation of dual-port DRAMs. These new DRAMs are still not suitable for second level cache operation, however, because their initial access time and random access cycle time remain much slower than necessary.
It would therefore be desirable to have a structure and method which enables DRAM memory to be used as a second level cache memory.
Prior art computer systems have also included multiple levels of SRAM cache memory integrated on the same chip as the CPU. For example, DEC""s Alpha 21164 processor integrates 16 Kbytes of first level SRAM cache memory and 96 Kbytes of second level SRAM memory on the same chip. In such cases, a third level SRAM cache is typically used between the processor and a DRAM main memory. In such a computer system, it would be desirable to use a DRAM memory to replace the third level SRAM cache memory.
Prior art high-performance second level SRAM cache memory devices generally conform to a set of pin and function specifications to assure that system logic controller 110 may operate compatibly with a variety of different SRAM cache memories from multiple suppliers. Several examples of such pin and function specifications are set forth in the following references: xe2x80x9cPentium(trademark) Processor 3.3V Pipelined BSRAM Specificationxe2x80x9d, Version 1.2, Intel Corporation, Oct. 5, 1994; xe2x80x9c32Kxc3x9732 CacheRAM(trademark) Pipelined/Flow Through Outputs Burst Counter, and Self-Timed Writexe2x80x94For Pentium(trademark)/PowerPC(trademark) Processorsxe2x80x9d, Advance Information IDT71V432, Integrated Device Technology, Inc., May 1994; and xe2x80x9c32K=32 CacheRAM(trademark) Burst Counter and Self-Timed Writexe2x80x94For the Pentium(trademark) Processorxe2x80x9d, Preliminary IDT71420, Integrated Device Technology, Inc., May 1994.
It is therefore desirable to have a method and structure which enables DRAM memory to be used as a second level cache memory which can be interfaced to a conventional logic controller which normally controls a second level SRAM cache memory. It is further desirable to have such a method and structure which requires minimal modification to the conventional logic controller.
In accordance with the present invention, a structure and method for configuring a DRAM array, or a plurality of DRAM arrays, as a second level cache memory is provided. A structure in accordance with the invention includes a computer system having a central processing unit (CPU), a SRAM cache memory integrated with the CPU, a CPU bus coupled to the CPU, and a second level cache memory comprising a DRAM array coupled to the CPU bus. The second level cache memory is configured as stand alone memory in one embodiment. In another embodiment, the second level cache memory is configured and integrated with system logic on a monolithic integrated circuit (IC). For high pin count microprocessors such as Intel""s Pentium, the companion system logic controller may be partitioned into multiple chips (e.g., Intel""s 82430 PCI set). In such a system, the second level cache DRAM array of the present invention may be integrated with one of the system logic chips, preferably the system logic chip(s) for the data path. In another configuration, the second level cache memory can be integrated with the CPU itself.
When accessing the DRAM array of the present invention, row access and column decoding operations are performed in a self-timed asynchronous manner. Predetermined sequences of column select operations are then performed, wherein the column select operations are synchronous with respect to a clock signal. This asynchronous-synchronous accessing scheme reduces the access latency of the DRAM array.
In one embodiment, the DRAM array is operated in a dual-edge transfer mode in response to the CPU bus clock signal. Consequently, the DRAM array performs access operations at a frequency which is twice as fast as the frequency of the CPU bus clock signal. DRAM access therefore occurs twice as fast as operations on the CPU bus.
In another embodiment, the second level cache memory includes a phase locked loop (PLL) circuit coupled to the CPU bus. The PLL circuit generates a fast clock signal having a frequency greater than the frequency of a CPU bus clock signal. The fast clock signal is provided to the DRAM array to control read and write operations. In one embodiment, the fast clock signal has a frequency equal to twice the frequency of the CPU bus clock signal. Again, DRAM access occurs twice as fast as the operations on the CPU bus.
In yet another embodiment, the second level cache memory includes a phase locked loop (PLL) circuit coupled to the CPU bus. The PLL circuit generates buffered clock signals at the same frequency as the CPU bus clock signal and may have various phase relationships with respect to the CPU bus clock signal.
Data values can be read from the DRAM array to the CPU bus through a read first in first out (data buffer) memory having a data input port coupled to the DRAM array and a data output port coupled to the CPU bus. The data input port is clocked by the fast clock signal and the data output port is clocked by the CPU bus clock signal. Because data is read out of the DRAM array faster than the data is read out to the CPU bus, additional time is available during which the DRAM array can be precharged. The precharge time is thereby xe2x80x9chiddenxe2x80x9d from the CPU bus during a read operation from the second level cache memory. Alternatively, the width of the data input port between the DRAM array and the read data buffer can be widened, and the data input port can be clocked by a buffered version of the CPU bus clock signal. This alternative also provides a faster internal data transfer rate between the DRAM array and the read data buffer, thereby providing additional time in which the DRAM array can be precharged.
Data values can also be written from the CPU bus to the DRAM array through a write data buffer memory having a data output port coupled to the DRAM array and a data input port coupled to the CPU bus. The output port of the write data buffer memory is clocked by the fast clock signal and the input port of the write data buffer memory is clocked by the CPU bus clock signal. A first set of data values is written and stored in the write data buffer memory until a second set of data values is written to the write data buffer memory. At this time, the first set of data values is written to the DRAM array at the frequency of the fast clock signal. Because the first set of data values is written to the DRAM array faster than the second set of data values is written to the write data buffer memory, a DRAM precharge operation can be performed during the time the second set of data values is written to the write data buffer memory. Therefore, the DRAM precharge operation is effectively xe2x80x9chiddenxe2x80x9d from the CPU bus during a write operation to the second level cache memory. Alternatively, the width of the data output port between the write data buffer memory and the DRAM array can be widened, and the data output port can be clocked by a buffered version of the CPU bus clock signal. This alternative also provides a faster internal data transfer rate between the write data buffer memory and the DRAM array, thereby providing additional time in which the DRAM array can be precharged.
By operating the DRAM array with a faster clock signal or a wider data path than the CPU bus, a DRAM memory array can be used to satisfy the speed and operational requirements of a second level cache memory. Such a DRAM memory array can be used at a lower cost, typically 75% less, than traditional SRAM implementations.
In another embodiment, data values to and from the DRAM array are routed through a sense amplifier circuit, a data amplifier circuit and a column selector coupled between the sense amplifier circuit and the data amplifier circuit. Writing data values to the DRAM array then involves the steps of (1) opening the column selector to isolate the data amplifier circuit from the sense amplifier circuit, (2) writing the data values from the write data buffer memory to the data amplifier circuit substantially in parallel with performing a row access operation in the DRAM array, and (3) closing the column selector to connect the data amplifier circuit to the sense amplifier circuit, thereby causing the data values to be provided to the DRAM array through the sense amplifier circuit. By writing data values to the write data buffer memory in parallel with the row access operation, more time is available to precharge the DRAM array.
The column selector can also be used during a DRAM read operation to provide additional time for a DRAM precharge operation. To do this, data values are read from the DRAM array to the sense amplifier circuit. The column selector is then closed to connect the sense amplifier circuit to the data amplifier circuit. After the data values have been written to the data amplifier circuit, the column selector is opened, thereby isolating the sense amplifier circuit from the data amplifier circuit. The data values can then be read out of the data amplifiers while the DRAM array is being precharged.
The DRAM cache memory of the present invention operates on a transaction by transaction basis. A transaction is defined as a complete read or write data access cycle for a given address. A transaction can involve the transfer of a single data value, or the burst transfer of 4 data values. A burst transfer can transfer the data values on consecutive clock cycles, every other clock cycle, every third clock cycle, etc. A transaction in the DRAM cache memory must be executed as either a read or a write transaction, but cannot be both. That is, the DRAM cache memory transaction can not include partial read and partial write transactions, or change from a read transaction into a write transaction before the data transfer begins. In contrast, in standard SRAM, Burst SRAM (BSRAM) or Pipelined Burst SRAM (PBSRAM) memories, a transaction can start as either a read or a write and change into write or read on a clock by clock basis. This is because SRAM accesses, whether with or without input registers or output registers, are directly from and to the memory cell array and the read or write operation can be applied to the memory cells directly.
The transaction-based configuration of the DRAM cache memory of the present invention utilizes control signals to prevent any incorrect or delayed internal operations which might otherwise occur due to the internal two-step access (RAS and CAS) of the DRAM cache memory and the write data buffer used to buffer the write operation. In a preferred embodiment, a CPU-initiated address strobe input signal (ADSP#) and a controller-initiated address strobe input signal (ADSC#) are used to indicate the start of new transactions in a manner compatible with standard PBSRAM. A byte write enable input signal (BWE#) and a global write input signal (GW#) are used as write control signals in a manner compatible with standard PBSRAM. An additional W/R# input signal (which is typically driven by the CPU) is incorporated to enable read and write transactions of the DRAM cache memory to be performed in a well-defined manner.
The DRAM array, unlike the SRAM array, also requires periodic refresh operations to restore the charge in the cell capacitors to guarantee data integrity. To manage the internal refresh operation of the DRAM array without disrupting normal CPU and system controller operations, a handshake (Krdy) signal is required to communicate between the DRAM cache memory and the system controller, so that the latter may delay its own operation and operation of the CPU while the DRAM array is being refreshed. In a preferred embodiment, one signal pin of the DRAM array is used to carry the handshake signal. The single pin maintains maximum compatibility with standard PBSRAM system controllers.
In one embodiment, the falling edge of the Krdy signal indicates there is a pending refresh or other internal operation request, and the rising edge of the Krdy signal indicates the refresh or other internal operation has been completed. The polarity of the Krdy signal is chosen arbitrarily, and opposite polarity can be used to accomplish the same effect. Both the DRAM cache memory and the system controller sample the Krdy signal at least at the beginning of each new transaction, whether the transaction is initiated by the ADSP# or ADSC# signal.
The Krdy signal can be used in different manners. In a preferred embodiment, the Krdy signal is implemented as an input/output signal. When multiple DRAM cache memory devices are used together for memory width or depth expansion or both, the Krdy signal can be used for synchronizing the DRAM refresh and/or internal operation among the multiple devices. Specifically, one of the DRAM cache memory devices is designated as a master device for refresh management. This master DRAM cache memory device uses the Krdy signal to communicate with the system controller and control the refresh management function. Each of the remaining DRAM cache memory devices share the Krdy signal line and are designated as slave devices. Each slave device samples the state of the Krdy signal to control its own refresh or internal operation as appropriate.
In an alternative embodiment, the Krdy signal is driven by the system controller, and each DRAM cache memory, upon detecting a low Krdy signal, will initiate and complete a pre-defined refresh operation.