The evolution of the dynamic random access memories used in computer systems has been driven by ever-increasing speed requirements mainly dictated by the microprocessor industry. Dynamic random access memories (DRAMs) have generally been the predominant memories used for computers due to their optimized storage capabilities. This large storage capability comes with the price of slower access time and the requirement for more complicated interaction between memories and microprocessors/microcontrollers than in the case of say static random access memories (SRAMs) or non-volatile memories.
In an attempt to address this speed deficiency, various major improvements have been implemented in DRAM design, all of which are well documented. DRAM designs evolved from Fast Page Mode (FPM) DRAM to Extended Data Out (EDO) DRAMs synchronous DRAMs (SDRAMs). Further speed increases have been achieved with Double Data Rate (DDR) SDRAM, which synchronizes data transfers on both clock edges. However, as the speed requirements from the microprocessor industry continue to move ahead, new types of memory interfaces have had to be contemplated to address the still existing vast discrepancy in speed between the DRAMs and microprocessors.
Recently, a number of novel memory interface solutions aimed at addressing the speed discrepancy between memory and microprocessors have been presented.
Several generations of high bandwidth DRAM-type memory devices have been introduced. Of note is Rambus Inc which first introduced a memory subsystem in which data and command/control information is multiplexed on a single bus and described in U.S. Pat. No. 5,319,755 which issued Jun. 7, 1994. Subsequently, Concurrent Rambus(trademark) was introduced which altered the command/data timing but retained the same basic bus topology. Finally, Direct Rambus(trademark) described in R. Crisp xe2x80x9cDirect Rambus Technology: The New Main Memory Standardxe2x80x9d, IEEE Micro, November/December 1997, p.18-28, was introduced in which command and address information is separated from data information to improve bus utilization. Separate row and column command fields are provided to allow independent control of memory bank activation, deactivation, refresh, data read and data write (column) commands. All three Rambus variations however share the same bus topology as illustrated in FIG. 1(a).
In this topology a controller 10 is located at one end of a shared bus 12, while a clock driver circuit 14 and bus terminations 16 are located at an opposite end. The shared bus includes, data and address/control busses, which run from the controller at one end to the various memory devices MEMORY 1 . . . MEMORY N and the terminations at the far end. The clock signal generated by the clock driver 14 begins at the far end and travels towards the controller 10 and then loops back to the termination at the far end. The clock bus is twice as long as the data and address/control busses. Each memory device has two clock inputs ClkFromController and ClkToController respectively, one for the clock traveling towards the controller cTc, and another for the clock traveling away from the controller cFc towards the termination. When the controller 10 reads from a memory device, the memory device synchronizes the data it drives onto the bus with the clock traveling towards the controller. When the controller is writing to a memory device, the memory device uses the clock traveling away from the controller to latch in data. In this way the data travels in the same direction as the clock, and clock-to-data skew is reduced. The memory devices employ on-chip phase locked loops (PLL) or delay locked loops (DLL) to generate the correct clock phases to drive data output buffers and to sample the data and command/address input buffers.
There are a number of shortcoming with this topology as will be described below.
For the bus topology of FIG. 1(a) the clock frequency is 400 MHz. FIG. 1(b) shows the timing of control and data bursts on the bus 12. Since data is transmitted or received on both edges of the clock, the effective data rate is 800 Mb/s. A row command ROW burst consists of eight (8) consecutive words, beginning on a falling edge of the clock from the controller cFc and applied on the three (3) bit row bus. A column command COL consists of eight (8) consecutive words transmitted on the five (5) bit column bus. Independent row and column commands can be issued to the same or different memory devices by specifying appropriate device identifiers within the respective commands. At the controller 10 the phases of the two clock inputs, cFc and cTc, are close together. There is a delay to the memory chip receiving the commands due to finite bus propagation time, shown in FIG. 1 as approximately 1.5 bit intervals or 1.875 ns. The clock signal cFc propagates with the ROW and COL commands to maintain phase at the memory inputs. Read data resulting from a previous COL command is output as a burst of eight (8) consecutive 16 or 18 bit words on the data bus, starting on a falling edge of cTc. The data packet takes roughly the same amount of time to propagate back to the controller, about 1.5 bit intervals. The controller spaces COL command packets to avoid collisions on the databus. Memory devices are programmed to respond to commands with fixed latency. A WRITE burst is driven to the databus two bit intervals after the end of the READ burst. Because of the finite bus propagation time, the spacing between READ and WRITE bursts is enlarged at the memory inputs. Likewise, the spacing between a WRITE and READ burst would be smaller at the memory device than at the controller.
For example, there is a summation of clock-to-data timing errors in transferring data from one device to another. FIG. 2(a) is a schematic diagram of the loop-back clock, data lines and clock synchronization circuit configuration. In this configuration, the bus clock driver 14 at one end of the ClockToController line 22 of the clock bus propagates an early bus clock signal in one direction along the bus, for example from the clock 14 to the controller 10. The same clock signal then is passed through the direct connection shown to a second line 24 of the bus and loops back, as a late ClockFromController, along the bus where it terminates with resistance Rterm. Thus, each memory device 26 receives the two bus clock signals at a different time. The memory device 26 includes a clock and data synchronization circuit for sampling the two bus clocks cFc and cTc and generating its own internal transmit and receive clocks TX_clk and RX_clk respectively, for clocking transmit and receive data to and from the databus respectively. The bus clock signals cFc and cTc are fed via respective input receiver comparators 11 and 20 into corresponding PLL/DLL circuits 40 and 50. For the input of data from the controller to a memory device, the role of the on-chip PLL/DLL circuit 40 is to derive from the cFc clock input, internal clocks to sample control, address, and data to be written to the memory on (positive 90xc2x0 and negative 270xc2x0) edges of the clock, at the optimum point in the data eye. These internal receive data clocks may also be used to drive the internal DRAM core 32. For the output of data from the memory device 26 to the controller 10, the role of the on-chip PLL/DLL circuit 50 is to derive from the cTc clock input internal transmit data clocks (0xc2x0 and 180xc2x0) to align transmitted data (read data from the memory core) with the edges of the external clock.
The data I/O pin has an output transistor 27 for driving the data bus. An actual memory device will have 16 or 18 such data pins. The other data pins are not shown in FIG. 2(a) for simplicity. During times when the device is not driving read data onto the databus the gate of output transistor 27 is held at logic 0 by OE being logic 0, so as not to interfere with write data or read data from another device which may appear on the bus.
Row control and column control input pins are also shown in FIG. 2(a) and it is understood that address signals are also received via the data bus. They have a structure identical to the data I/O pin, except that the gate of the output transistor 27xe2x80x2 is tied to logic 0, since output drive is never required. The disabled output transistor 27xe2x80x2 matches the capacitive load presented to the external bus to that of a data I/O pin, so that signal propagation characteristics are identical for all inputs, address row control, column control, and data. The two clock inputs have similar dummy output transistors 28 and 29, to equalize loading.
In the prior art system, Vterm is equal to 1.8 v, Rterm is 20xcexa9, and the current Iout provided by the device driving the bus is 40 mA. This is shown schematically in FIG. 2(b). In this configuration, a high level signal is equal to the bus termination voltage, Vterm (1.8 v) and a low level signal is equal to Vterm=Iout*Rterm (1.0 v). Power consumed while the signal is pulled low is 72 mW, of which 40 mW is dissipated on chip, and 32 mW in the termination. Assuming an equal probability of high and low data, the average power dissipation would be 36 mW, of which 20 mW is dissipated on chip and 16 mW in the termination.
Given the high and low voltage range, the reference voltage for the comparator is set to 1.4 v, which is midway between high and low levels on the bus. The input timing waveforms for this circuit configuration is shown in FIG. 3. The cFc signal is delayed from the pin through the input comparator 11. The rising edges of the clock cFc signal are shown as a shaded area 134 on the timing diagram because of the differences between the generation of rising and falling edges. Falling edges are more accurate since they are generated by on chip drivers and are calibrated to produce the desired low level signal on the bus. On the other hand, the rising edges are created by the bus termination pullup resistor and will have different edge characteristics depending on the distance from the termination, number of loads on the bus, etc. Because of the differences in rising and falling edges, the received clock and data signals may not have precise 50/50 duty cycles. The DLL/PLL block 40 responds only to the falling edge of the clock input, since it is the most accurate edge. The DLL/PLL generates four outputs at 0xc2x0, 90xc2x0, 180xc2x0 and 270xc2x0. These outputs are phase locked to the data input. The DLL/PLL shifts the free running clock input to align the 0xc2x0 and 180xc2x0 outputs to input data edge transitions. The 90xc2x0 and 270xc2x0 outputs can then be used to sample input data in odd and even latches corresponding to data generated on rising and falling edges of the clock respectively.
There will be some timing error xcex94r and xcex94f, on the rising edge and falling edges of DLL outputs respectively, with respect to the output of the clock comparator, as shown in FIG. 3. These timing errors may occur due to any one or a combination of static phase offset, timing jitter and wander resulting from inaccuracies and mismatches within the components making up the DLL/PLL loop. The 0xc2x0 and 180xc2x0 outputs will be aligned to the average transition points. Since the DLL/PLL outputs a 50/50 duty cycle signal, while the data inputs may have a degraded duty cycle due to the aforementioned asymmetrical drive problem, this results in a further error in positioning the clock for optimal data sampling. The timing errors between clock and data created at the transmitting device and the receiving device are cumulative and can result in data errors.
The output timing waveforms for the circuit of FIG. 2(a) are shown in FIG. 4. The DLL/PLL 50 shown in FIG. 2(a) takes the free running ClockToController and creates delayed versions of the free running clock. The DLL/PLL monitors transmit data (read data from the core memory) output to the databus via output driver transistor 27 through comparator 30 and adjusts the delay of the 0xc2x0 and 180xc2x0 clocks which drive the output latches 51 to align output data transitions to transitions of the ClockToController transmit clock. Due to the asymmetrical nature of the rising and falling edges appearing on the ClockToController bus, all outputs from the DLL/PLL 50 are generated from falling edges of the free running input ClockToController clock. The output data latching function is shown conceptually to include odd and even data latches and a multiplexer which alternates between the two data streams. The output data latch is followed by an AND gate which performs an output disable function, holding the gate of output driver transistor 27 at logic zero when data is not being read from the device. Similarly to the input data case, timing errors between clock and data are cumulative and can result in data errors.
Another shortcoming of the prior art implementation shown in FIG. 1(a) is the system""s method for dealing with intersymbol interference. Data transitions do not always occur in the same position relative to clock edges due to a number of factors. The clock is a repetitive waveform with which there will be a constant delay from one rising edge to another or from one falling edge to another. Data transitions are dependent on the previous bits transmitted, particularly on a long bus whose propagation delay exceeds one bit period. This effect is known as intersymbol interference (ISI). The effect of different data histories creates data transitions at different times relative to the clock. Basing the input sampling time purely on a fixed phase of the input clock, as in the architecture of FIG. 1(a), will be suboptimal in the presence of ISI. Other effects such as crosstalk coupling between other wires near the signal in question, which can be either in phase or out of phase, and data dependent power supply coupling affecting both input buffers and output drivers, can also close the effective data eye, i.e., the window during which data can be successfully sampled.
A further shortcoming of the prior art is that open drain outputs, shown schematically in FIG. 2(b) are used to drive signals from a device onto the bus in the system of FIG. 1(a). Because the falling edge of the clock is created by a clock generator pull-down transistor (not shown), while the rising edge is created by the bus termination resistor, it is difficult to match pulse rise time and pulse fall time. This can lead to non-symmetric duty cycle on the clock bus. To resolve this problem, the clock falling edge can be used as a timing reference and the clock rising edge can be re-synthesized internally with the DLL/PLL. However, this approach creates an internal sampling instant that is unrelated to the data edge of the bit being sampled, compounding the effects described above and resulting in further closure of the data eye, since subsequent data bits cannot be known in advance whereas a clock sequence is repetitive and therefore, determinable in advance.
Each device discussed above in the prior art self-calibrates its output pulse amplitude levels. Either an external reference or an internally generated reference level is required, along with precision comparator circuits and calibration control circuitry. Inaccuracies in any of these elements may lead the output amplitudes from different devices to vary, resulting in further closure of the data eye.
The number of devices in a the prior art configuration described above is limited to 32 because of the loading and length of the bus. With 64M devices the total memory capacity is limited to 256 MB. If a larger memory configuration is required the controller must support several busses in parallel, consuming additional pins, silicon area, and power.
The packaging technology for the prior art implementation described above is called Chip Scale Packaging or uBGA (micro Ball Grid Array). The intent of this packaging technology is to minimize the stub length from the connection to the bus to the on-chip input and output buffers. The length of the stub on the module is virtually eliminated by routing the bus through the module. Although stub length is reduced compared to standard packaging and module technology, there is still as much as 5 mm of stub within the uBGA package itself. This stub can cause reflections on the bus to the detriment of signal integrity.
Another shortcoming of the prior art approach is the requirement for a separate clock generator chip. Furthermore, there is twice as much load on the clock as on any other signal, and the clock line is twice as long. Ultimately, the maximum frequency at which the system can operate will be limited by the doubly loaded clock line. The pulse symbols in FIG. 5 show how systematic skew can develop between clock and write data at the far end of the bus. The clock reaching the controller cTc has already been attenuated by traveling the full length from the clock generator to the controller, and most of the higher order harmonics have been removed. At this point, cTc and cFc clocks should be identical and the controller synchronizes transitions of write data with zero crossings of the filtered cFc clock. The write data appears on the bus at this point with sharp edges and unattenuated amplitude. Because of the different frequency composition of the clock and write data, there is different group delay between clock and data at the far end of the bus. Since the cFc clock is somewhat attenuated already, further attenuation will not significantly affect its zero crossings. On the other hand, the write data, when attenuated, will lose its higher order harmonics which create the square wave form, resulting in a wave form as shown where the zero crossings have been significantly shifted. Therefore, transitions between clock and write data at the far end are skewed by an amount shown as tskew. As a result, write data sampling will not occur at the correct time.
Thus it may be seen that the prior art configuration described suffers from various disadvantages. The present invention seeks to mitigate at least some of these disadvantages.
Accordingly, one object of the present invention is to provide an improved high bandwidth chip-to-chip interface for memory devices, which is capable of operating at higher speeds, while maintaining error free data transmission, consuming lower power, and supporting more load.
Another object of the invention is to eliminate the requirement for a separate clock generator chip.
A further object of the invention is to provide a clock adjustment scheme to compensate for intersymbol interference, crosstalk, noise, and voltage and temperature drift in memory devices.
A still further object is to provide an improved bus topology in which clocks travel the same distance as data and do not limit overall bus performance.
A still even further object is to provide an improved packaging for these devices.
A still yet even further object is to provide a means to expand the number of memory devices that can be supported by a single controller.
In accordance with this invention, there is provided a memory subsystem comprising
a) at least two semiconductor devices;
b) a main bus containing a plurality of bus lines for carrying substantially all data and command information needed by the devices, the semiconductor devices including at least one memory device connected in parallel to the bus; the bus lines including respective row command lines and column command lines;
c) a clock generator for coupling to a clock line, the devices including clock inputs for coupling to the clock line; and the devices including programmable delay elements coupled to the clock inputs to delay the clock edges for setting an input data sampling time of the memory device.
According to a further aspect of the invention there is provided
a) a core memory;
b) a plurality of terminal for coupling to a,bus including a free running clock and a data clock terminal and data I/O terminals;
c) a source synchronous clock generator for synchronising the output data clock with the output data in response to the free running clock.
According to one aspect of the invention the semiconductor devices include a clock offset fine adjustment for optimizing the sampling of received data, wherein the adjustment can be set during power up and periodically during operation by the controller to compensate for temperature and voltage drift.
A further aspect of the invention provides a memory subsystem including synchronous data clocks for source synchronous clocking, while the loopback clock is used to provide a free running clock to transmit data and to time the start of bursts to position consecutive data bursts appropriately in order to avoid overlap between consecutive bursts.
A further aspect of the invention provides a memory subsystem including means for calibrating the clock offset fine adjustment by utilizing a power up synchronization sequence. Preferably, the synchronization sequence is a bit sequence that includes a number of bit patterns such as a psuedorandom pattern to evaluate substantially all meaningful intersymbol interference histories in order to set an optimum time for a sampling instant.
A further aspect of the invention provides a memory subsystem loopback clock architecture including a push pull I/O. This allows both rising and falling edges to be used for sampling data, thereby reducing the sensitivity of the system to clock duty cycle variation. This approach also saves power in the device itself allowing more cost-effective packaging.
A further aspect of the invention provides a memory subsystem wherein the semiconductor device includes a controller, which in turn includes means for calibrating the output high/output low voltage levels Voh/Vol of the memory devices by writing to registers in the memories to increment or decrement output levels and comparing the result on the bus to a reference voltage level local to the controller.
A further aspect of the invention provides a memory subsystem wherein a repeater appears as a single load on the main bus but drives a set of signals identically to the controller to create a sub-bus on which memory devices can be connected. The repeater acts as a controller on this sub-bus and memory devices cannot distinguish between the main bus and the sub-bus and therefore, operate identically on either one. The increased latency of devices on the sub-bus compared to those connected directly to the main bus may be corrected by the controller by scheduling activity appropriately.
A still further aspect of the invention provides a memory subsystem, wherein the semiconductor devices include series stub resistors wherein the main bus is routed through the device to mitigate the effects of the stubs. Furthermore conventional TSOP type packaging is used for lower cost.
In accordance with a further aspect of this invention, there is provided a memory subsystem comprising at least two semiconductor devices; a main bus containing a plurality of bus lines for carrying substantially all address, data and control information needed by the devices, the semiconductor devices including at least one memory device connected in parallel to the bus; where read and write data are accompanied by echo clocks, and burst placement is performed via vernier adjustment under control of the controller.