This invention relates generally to high speed data transfer, and more particularly to computer systems and methods for high speed data transfer.
Contemporary high performance computing main memory systems are generally composed of one or more dynamic random access memory (DRAM) devices, which are connected to one or more processors via one or more memory control elements. Overall computer system performance is affected by each of the key elements of the computer structure, including the performance/structure of the processor(s), any memory cache(s), the input/output (I/O) subsystem(s), the efficiency of the memory control function(s), the main memory device(s), and the type and structure of the memory interconnect interface(s).
Extensive research and development efforts are invested by the industry, on an ongoing basis, to create improved and/or innovative solutions to maximizing overall system performance and density by improving the memory system/subsystem design and/or structure. High-availability systems present further challenges as related to overall system reliability due to customer expectations that new computer systems will markedly surpass existing systems in regard to mean-time-between-failure (MTBF), in addition to offering additional functions, increased performance, increased storage, lower operating costs, etc. Other frequent customer requirements further exacerbate the memory system design challenges, and include such items as ease of upgrade and reduced system environmental impact (such as space, power and cooling).
FIG. 1 relates to U.S. Pat. No. 5,513,135 to Dell et al., of common assignment herewith, which is hereby incorporated by reference in its entirety, and depicts an early synchronous memory module. The memory module depicted in FIG. 1 is a dual in-line memory module (DIMM). This module is composed of synchronous DRAMs 8, buffer devices 12, an optimized pinout, and an interconnect and capacitive decoupling method to facilitate high performance operation. The patent also describes the use of clock re-drive on the module, using such devices as phase-locked loops (PLLs).
Digital PLLs (DPLLs) are described, for example, in U.S. Pat. No. 6,115,439 to Andresen, et al (hereinafter “Andresen”), which is hereby incorporated by reference in its entirety. Andresen teaches a DPLL for receiving a reference clock signal and for generating a clock signal having a predetermined number of clock cycles during each cycle of the reference clock. The DPLL in Andresen includes a variable delay circuit for varying the length of each of the generated clock cycles so that they are in lock with the reference clock cycle. Compare circuitry in the clock multiplier determines whether the end of the generated clock cycles are within one of several thresholds relative to the end of the reference clock cycle. The length of the generated clock cycles is varied by an incremental delay based on the output of compare circuitry. Andresen further teaches that the clock multiplier is free running once lock is obtained between the reference clock and the generated clock.
FIG. 2 illustrates a simplified block diagram of a clock multiplier (DPLL) 240 taught by Andresen. Control block 242 receives an input clock CKref. Responsive to CKref, control block 242 outputs a clock signal CKa to variable delay circuit 244, along with delay control signals. Variable delay circuit 244 outputs clock signal CKar to control block 242. Control block 242 outputs a clock, CKout and a lock signal indicating whether the CKout is locked on the input signal CKref. As shown in FIG. 2, the clock multiplier 240 uses a single variable delay stage 244, which reduces the power consumed by the clock multiplier 240.
FIG. 3 illustrates a block diagram of the variable delay circuit 244 taught by Andresen. Clock CKa is received into a delay string 346 comprising fifteen delay elements 348 connected in series, where individual delay elements are referenced as delay elements 348a-348o. As would be clear to one skilled in the art, more or less delay elements could be used in the delay string 346. Switch 350 couples leads A and B across one of the delay elements 348. Switch 350 is controlled by the delay string controller (DSR) section of control block 242. Lead A couples the input side of the selected delay element 348 to a variable delay path 352 and lead B couples the output side of the selected delay element 348 to path 354. Variable delay path 352 comprises a buffer 356 and a plurality of capacitors 358 (fifteen capacitors 358 are shown in used in the embodiment of FIG. 3), individually referenced as capacitors 358a-358o. Capacitors 358 are selectively coupled in parallel between the output of buffer 356 and ground under control of the DSR section of control block 242. Path 354 comprises a buffer 360 and a plurality of capacitors 361 (individually referenced as capacitors 361a-361o) matching buffer 356 and capacitors 358. In this path the DSR section of the control block 242 uncouples all capacitive loads. An additional capacitor 362 is selectively coupled between the output of buffer 356 and ground and between the output of buffer 360 and ground under control of the epsilon controller (EC) section of control block 242. Paths 352 and 354 are coupled at commutator 364. The output of commutator 364 is clock CKar. Commutator 354 passes the edge of clock CKa which arrives first (either the edge which arrives via through path 352 or the edge that arrives through path 354). In operation, delay elements 348 provide coarse adjustment of the delay through variable delay circuit 344. When switch 350 couples leads A and B across delay element 348a, the smallest delay through the delay chain 346 is achieved, as CKa propagates through a single delay element 348. On the other hand, when switch 350 couples leads A and B across delay element 348o, the largest delay is achieved, since CKa must propagate through all the delay elements 348 in the delay string 346. The capacitors 358 (and capacitors 361) provide a finer resolution of delay through delay element 244. After propagating through the delay element(s) 348, the CKa signal propagates through the delay path 352. The capacitors may be individually coupled to ground under control of control block 242. Each capacitor 358 enabled by control block 242 slightly increases the delay of the propagation through the delay path 352. Each enabled capacitor 358 (or capacitor 361) approximately accounts for an additional 50 psec maximum delay through the path. The commutator 364 passes the first CKa clock edge, either from path 352 or path 354, in order to prevent discontinuities in the incremental delay provided by the variable delay circuit 244.
FIG. 4 illustrates a block diagram of the control block 242 as taught by Andresen. CKref is input to SRPD 470, which is a phase detector and multivibrator control circuit. The SRPD 470 outputs the CKa signal to the variable delay circuit 244 and to prescaler 472. SRPD 470 further outputs control signals to the DSR 474, a commutator controller (IC) 476 and the EC 478. The DSR (IC) 476 and EC 478 output delay control signals to the variable delay controller 244. In operation, the SRPD 470 receives the reference clock, CKref, and generates signals indicative of whether CKout is locked with CKref and whether it leads or lags CKref. The lock and up/down signals are passed to the DSR 474, IC 476 and EC 478 which adjust the delay through the delay string 346 and delay path 352 accordingly.
FIG. 5 relates to U.S. Pat. No. 6,173,382 to Dell et al., of common assignment herewith, which is hereby incorporated by reference in its entirety, and depicts a computer system 510 which includes a synchronous memory module 520 that is directly (i.e. point-to-point) connected to a memory controller 514 via a bus 540, and which further includes logic circuitry 524 (such as an application specific integrated circuit, or “ASIC”) that buffers, registers or otherwise acts on the address, data and control information that is received from the memory controller 514. The memory module 520 can be programmed to operate in a plurality of selectable or programmable modes by way of an independent bus, such as an inter-integrated circuit (I2C) control bus 534, either as part of the memory initialization process or during normal operation. When utilized in applications requiring more than a single memory module connected directly to a memory controller, the patent notes that the resulting stubs can be minimized through the use of field-effect transistor (FET) switches to electrically disconnect modules from the bus.
Relative to U.S. Pat. No. 5,513,135, U.S. Pat. No. 6,173,382 further demonstrates the capability of integrating all of the defined functions (address, command, data, presence detect, etc) into a single device. The integration of functions is a common industry practice that is enabled by technology improvements and, in this case, enables additional module density and/or functionality.
FIG. 6, from U.S. Pat. No. 6,510,100 to Grundon et al., of common assignment herewith, which is hereby incorporated by reference in its entirety, depicts a simplified diagram and description of a memory system 610 that includes up to four registered DIMMs 640 on a traditional multi-drop stub bus. The subsystem includes a memory controller 620, an external clock buffer 630, registered DIMMs 640, an address bus 650, a control bus 660 and a data bus 670 with terminators 695 on the address bus 650 and the data bus 670. Although only a single memory channel is shown in FIG. 6, systems produced with these modules often included more than one discrete memory channel from the memory controller, with each of the memory channels operated singly (when a single channel was populated with modules) or in parallel (when two or more channels where populated with modules) to achieve the desired system functionality and/or performance.
FIG. 7, from U.S. Pat. No. 6,587,912 to Bonella et al., which is hereby incorporated by reference in its entirety, depicts a synchronous memory module 710 and system structure in which repeater hubs 720 include local re-drive of the address, command and data to local memory devices 701 and 702 via buses 721 and 722; generation of a local clock (as described in other figures and the patent text); and the re-driving of the appropriate memory interface signals to the next module or component in the system via bus 700.
FIG. 8 depicts a contemporary system composed of an integrated processor chip 800, which contains one or more processor elements and an integrated memory controller 810. In the configuration depicted in FIG. 8, multiple independent cascade interconnected memory interface busses 806 are logically aggregated together to operate in unison to support a single independent access request at a higher bandwidth with data and error detection/correction information distributed or “striped” across the parallel busses and associated devices. The memory controller 810 attaches to four narrow/high speed point-to-point memory busses 806, with each bus 806 connecting one of the several unique memory controller interface channels to a cascade interconnect memory subsystem 803 (or memory module) which includes at least a hub device 804 and one or more memory devices 809. Some systems further enable operations when a subset of the memory busses 806 are populated with memory subsystems 803. In this case, the one or more populated memory busses 808 may operate in unison to support a single access request.
It is desirable to be able to increase the bandwidth of DRAM device access in order to increase the speed with which data can be read and written to DRAM devices. One approach to increasing the bandwidth is taught by U.S. Pat. No. 6,378,020 to Farmwald et al. (hereinafter “Farmwald”), which is hereby incorporated by reference in its entirety. Farmwald teaches a memory subsystem that includes at least two semiconductor devices (including at least one memory device), connected in parallel to a bus, where the bus includes a plurality of bus lines for carrying substantially all address, data and control information needed by the memory devices. The control information includes device-select information and the bus has substantially fewer bus lines than the number of bits in a single address, and the bus carries device-select information without the need for separate device-select lines connected directly to individual devices. Farmwald teaches that the DRAMs and other devices receive address and control information over the bus and transmit or receive requested data over the same bus. Each memory device contains only a single bus interface with no other signal pins. Other devices that may be included in the system can connect to the bus and other non-bus lines, such as input/output lines. The bus supports large data block transfers and split transactions to allow a user to achieve high bus utilization. Farmwald teaches that high bus bandwidth is achieved by running the bus at a very high clock rate (hundreds of MHz).
Farmwald further teaches that an important part of its input/output circuitry is that it generates an internal device clock based on early and late external bus clocks. Farmwald teaches that controlling clock skew (the difference in clock timing between devices) is important in a system running with 2 nanosecond cycles, thus the internal device clock is generated so that the input sampler (for capturing data) and the output driver operate as close in time as possible to midway between bus clocks. Thus, the data is sampled based on both the rising edge and the falling edge of a single clock period of one or more of the external bus clocks.
A drawback of the approach taught by Farmwald, where data is sampled based on both edges of a single clock period is that the symmetry of the clock period has to be taken into account. Typically such clocking is avoided for high-speed applications because electrical characterstics (capacitance, inductance etc) of the environment have a different effect on the rising edge of a clock signal than on the falling edge. Furthermore, the switching level of a receiver receiving a rising clock edge may be very different than that of the receiver receiving the falling clock edge. In addition, such a clock mechanism requires the positive phase and negative phase of the clock period be identical. Additionally, other anomalies such as jitter and drift may adversely affect such a clocking scheme.
It would be desirable to have a memory subsystem that avoids the above drawbacks while providing a high speed bus for transferring multiple data phases in a single clock period.