In conventional non-pipelined dynamic random access memories (DRAMs) a data transfer to and from the memory is performed in sequence. That is, when a read or a write command is received and an address is made available, the data transfer according to either a read or write command is performed in its entirety before another command is accepted by the memory. This results in subsequent commands being delayed by the time it takes for the current data transfer to complete.
Historically, DRAMs have been controlled asynchronously by the processor. This means that the processor puts addresses on the DRAM inputs and strobes them in using the row address select signal ( RAS) and column address select signal ( CAS) pins. The addresses are held for a required minimum length of time. During this time, the DRAM accesses the addressed locations in memory and after a maximum delay (access time) either writes new data from the processor into its memory or provides data from the memory to its outputs for the processor to read.
During this time, the processor must wait for the DRAM to perform various internal functions such as precharging of the lines, decoding the addresses and such like. This creates a “wait state” during which the higher speed processor is waiting for the DRAM to respond thereby slowing down the entire system.
One solution to this problem is to make the memory circuit synchronous, that is, add input and output latches on the DRAM which can hold the data. Input latches can store the addresses, data, and control signals on the inputs of the DRAM, freeing the processor for other tasks—After a preset number of clock cycles, the data can be available on the output latches of a DRAM with synchronous control for a read or be written into its memory for a write operation.
Synchronous control means that the DRAM latches information transferred between the processor and itself under the control of the system clock Thus, an advantage of the synchronous DRAMs is that the system clock is the only timing edge that must be provided to the memory. This reduces or eliminates propagating multiple timing strobes around the printed circuit board.
Alternatively, the DRAM may be made asynchronous. For example, suppose a DRAM with a 60 ns delay from row addressing to data access is being used in a system with 10 ns clock, then the processor must apply the row address and hold it active while strobing it in with the ( RAS) pin. This is followed 30 ns later by the column address which must be held valid and strobed in with the ( CAS) pin. The processor must then wait for the data to appear on the outputs 30 ns later, stabilize, and be read.
On the other hand, for a synchronous interface, the processor can lock the row and column addresses (and control signals) into the input latches and do other tasks while waiting for the DRAM to perform the read operation under the control of the system clock. When the outputs of the DRAM are clocked six cycles (60 ns) later, the desired data is in the output latches.
A synchronous DRAM architecture also makes it possible to speed up the average access time of the DRAM by pipelining the addresses. In this case, it is possible to use the input latch to store the next address which the processor while the DRAM is operating on the previous address. Normally, the addresses to be accessed are known several cycles in advance by the processor. Therefore, the processor can send the second address to the input address latch of the DRAM to be available as soon as the first address has moved on to the next stage of processing in the DRAM. This eliminates the need for the processor to wait a full access cycle before starting the next access to the DRAM.
An example of a three stage column address pipeline is shown in the schematic diagram of FIG. 1(a). The column address-to-output part is a three stage pipeline. The address buffer is the first latch. The column switch is the second latch and the output buffer is the third latch. The latency inherent in the column access time is therefore divided up between these three stages.
The operation of pipelined read may be explained as follows: the column address (1) is clocked into the address buffer on one clock cycle and is decoded. On the second clock cycle, the column switch transfers the corresponding data (D1) from the sense amplifier to the read bus and column address (A2) is clocked into the address buffer. On clock three, the data (D1) is clocked into the output buffer, (D2) is transferred to the read bus and A3 is clocked into the column address buffer. When D1 appears at the output, D2 and D3 are in the pipeline behind it. For a more detailed discussion of the present technology, the reader is referred to a book entitled “High Performance Memories” by Betty Prince.
The delay in the number of clock cycles between the latching CAS in a SDRAM and the availability of the data bus is the “CAS latency” of the SDRAM. If the output data is available by the second leading edge of the clock following arrival of a column address, the device is described as having a CAS latency of two. Similarly, if the data is available at the third leading edge of the clock following the arrival of the first read command, the device is known as having a “CAS latency” of three.
Synchronous DRAMs (SDRAM) come with programmable CAS latencies. As described above, the CAS latency determines at which clock edge cycle data will be available after a read command is initiated, regardless of the clock rate (CLK). The programmable CAS latencies enable SDRAMs to be efficiently utilized in different memory systems having different system clock frequencies without affecting the CAS latency.
There are other ways to divide an SDRAM data path into latency stages. A wave pipeline is shown schematically in FIG. 1(b). A regular clocked pipeline has the disadvantage that the read latency will be equal to the delay of the slowest pipeline stage (i.e. longest delay) multiplied by the number of pipeline stages. A clocked pipeline with adjusted clocks uses clock signals that have been adjusted to each pipeline stage so that longer pipeline stages may be accommodated without impacting the read latency. A longer pipeline stage will be ended with a clock that is more delayed than the clock that starts the pipeline stage. A shorter pipeline stage will be started with a clock that is more delayed than the clock that ends the pipeline stage. A disadvantage of this scheme is that different adjustments to the clock are needed for each CAS latency supported by the chip. Also, architecture changes can have a large impact on the breakdown of the latency stages, requiring designers to readjust all the clocks to accommodate the new division of latency stages.
Furthermore there are a limited number of places where a latency stage can be inserted without adding extra latency or chip area. Multiple latency stages have a disadvantage in that not all latency stages will be equal in the time needed for signals to propagate through the stage. Another complication is the need to enable or disable latency stages depending on the CAS latency at which the chip has been programmed to operate.
In the wave pipeline of FIG. 1(b) runs pulses of data through the entire read data path. A wave pipeline relies on an ideal data path length, that is it assumes that all data paths are equal. However, data retrieved from certain memory cells in a memory array. will be inherently faster than data retrieval from other memory cells. This is primarily due to the physical location of the memory cells relative to both the read in and read out data path. Thus data must be resynchronized before being output from the chip. This data path skew makes it difficult to safely resynchronize the retrieved data in a wave pipeline implementation.
If address signals are applied to a data path with a cycle time which exceeds the memory access time, then the data which is read from the memory is not output during the inherent delay of the memory core. In other words, in the wave pipeline technique address input signals are applied with a period, which is less than the critical path of the memory core section.
Furthermore as illustrated in FIGS. 2(a) and 2(b) with a slow clock it is necessary to store the output data of the wave pipeline until the data is needed.