A computer network is a geographically distributed collection of interconnected communication links for transporting data between nodes, such as computers. Many types of computer networks are available, with the types ranging from local area networks (LANs) to wide area networks (WANs). A plurality of LANs may be further interconnected by an intermediate network node, such as a router or switch, to form an inter-network of nodes that extends the effective “size” of the computer network by increasing the number of communicating nodes. The nodes typically communicate by exchanging discrete frames or packets of data according to predefined protocols. In this context, a protocol consists of a set of rules defining how the nodes interact with each other.
Each node typically comprises a number of basic subsystems including processor, memory and input/output (I/O) subsystems. Data is transferred between the memory, processor and I/O subsystems over a system bus, while data requests within the memory subsystem occur over a memory bus coupling a memory controller to one or more memory devices. Each bus typically consists of address, data and control lines, with the control lines carrying control signals specifying the direction and type of transfer. For example, the processor may issue a read request to the memory controller, requesting the transfer of data from an addressed location on a memory device coupled to the memory bus. The processor may then process the retrieved data in accordance with instructions and thereafter may issue a write request to the controller to store the processed data in, e.g., another addressed location in the memory subsystem.
One of the control signals transmitted between the memory controller and memory devices is a clock signal used to control the timing of data transfer operations. The clock signal synchronizes the transmission and reception of data between physically separated points on the memory bus. The memory controller generates both local clock signals that are used to control logic on the controller within a local clock domain and remote clock signals used to control the memory devices within a remote clock domain when transmitting data to and from the memory controller. The local and remote clock signals are generated from the same clock source, e.g., a phase lock loop circuit on the memory controller, to thereby produce local and remote clock frequencies that are substantially identical.
For proper operation of the memory subsystem, clock signals should arrive at bus interface circuitry at the same time; otherwise, reliable data transmission is not ensured. For example, if a bus interface circuit receiving data is “clocked” later than others, the earlier-clocked bus interface circuits may overwhelm the data before it is stored at its proper destination. This lack of simultaneity in reception of the clock signals, i.e., clock skew, directly increases the amount of time that the data must remain stable on the memory bus to ensure reliable data transmission; this, in turn, increases the time required for each data transfer on the bus and, thus reduces the speed and performance of the memory subsystem.
The performance of the memory subsystem may be increased by increasing the number of high-speed memory devices in the subsystem, along with increasing the speed and width of the memory bus coupled to those devices. In this context, high speed denotes the transfer of a “piece” of data every nanosecond (nsec) or, generally, at gigahertz (GHz) data rates. FIG. 1 is a schematic block diagram of a typical high performance memory subsystem 100 comprising a memory controller 110 coupled to a plurality of high speed memory devices 130 over a memory bus 120. The interaction of the memory controller and memory devices is depicted in a linear fashion, illustrating the transfer of a request (e.g., a read or write request) from the controller to the memory devices and then, in the case of a read request, the return of requested data from the memory devices to the memory controller.
Assume the memory subsystem 100 includes, e.g., eight commodity memory devices 130, wherein each memory device is 32 data bits wide. Therefore, the eight memory devices of the memory subsystem collectively form a 256-bit data portion 126 of the memory bus 120. Moreover, the 32 data bits of each memory device are organized into four data groupings 125, wherein each data grouping is eight bits wide and has its own reference clock signal. For a typical commodity memory device, there can be as much as ±1100 picoseconds of skew between data groupings on the same device.
When issuing a request, the memory controller 110 transmits a source clock (clk) signal 122 along with address information 124 (and write data, if necessary) over extended board traces 152 on a printed circuit board (PCB 150) to the high-speed memory devices 130. The memory devices typically return a reference “echo” clock-signal 128 along with any requested read data over the data bus portion 126 to the memory controller 110. The reference echo clock signal 128 is a data output strobe (DQS) single bit or, more typically for high-speed memory devices, differential signal. The DQS clock signal 128 and the data bus (DQ) bits 126 are also transmitted over extended board traces 152 of the PCB.
When laying out the PCB 150, both the reference clock and data bus board traces 152 are routed to very precise lengths. However, there is still some degree of error in the routing because the signals carried over these traces may be routed through different layers of the PCB having different impedance characteristics that translate into timing differences. In general, process, voltage and temperature (PVT) differences, along with different dielectric constants among the various layers of the PCB 150, introduce substantial delays or skew into the memory subsystem 100. Moreover, the memory devices 130 in the memory subsystem may have different timing characteristics that introduce skew into the subsystem.
In general, there are many areas of the memory subsystem where substantial delay or skew is introduced into the system 100. For example, at the output of the memory controller there may be delay between the source clock and data/address bus signals, hereinafter referred to as Δt1. The memory controller 110 may be embodied as an application specific integrated circuit (ASIC) and a wide 256-bit data bus interface circuit 112 on the controller ASIC can have as much as 0.5 nsecs of skew. Here, the 256 bits of data are spread over a large area of the die and the reluctance on behalf of certain ASIC vendors to manually place and tune individual bits during placement may result in such skew.
In addition, board trace delays between the memory controller 110 and memory devices 130 may introduce skew, hereinafter denoted as Δt2. For example, the source clock signals 122 issued by the memory controller to the individual memory devices can have approximately 0.5 nsecs of skew. There are also delays/skew, denoted Δt3, between the various memory devices 130. In the case of reduce latency dynamic random access memory (RLDRAM) devices, there is a minimum-maximum delay of 1.5 nsecs to 2.3 nsecs for the DQS clock signal 128. The DQS reference clocks for each data grouping originating from the same memory device can have approximately 0.5 nsecs of skew. Each memory device further has its own unique microenvironment that can have different PVT characteristics, which can produce approximately 1 nsec of skew.
Skew also arises with respect to the board trace/layout and routing (along with crosstalk) for signals transmitted between the memory devices and the memory controller, hereinafter denoted Δt4. In this case, the delays associated with Δt4 can amount to another 0.5 nsecs of skew. As noted, logic in the memory controller 110 has finite delay and routing between the logic may not be identical, therefore translating into further skew, herein denoted Δt5. Here, the Δt5 skew arises between bus interface logic 112 on the memory controller used to capture data from the memory devices in the remote clock domain 170 and internal logic 114 on the controller 110 used to bring that data into the local clock domain 160 across a local clock boundary 165.
When operating the memory subsystem and the memory devices at high speed, every nsec can translate into a (clock) cycle of frequency. For example, operating the memory subsystem at 400 megahertz (MHz) frequency results in approximately a 2.5 nsec clock cycle rate. Yet, the data clock rate is half the clock cycle rate or 1.25 nsecs because of the use of double data rate (DDR) data capture. The skew in the memory subsystem may cause phase differences between the clock signals (on the order of a couple of nsecs) that results in the signals being entirely asynchronous. Operation of the memory subsystem at such high data rates may, in turn, result in portions of the read data being spread over multiple clock boundaries. That is, data returned to the memory controller 110 from the memory devices 130 in response to a read request may not arrive at the controller at the same time. When the memory controller attempts to capture the returned read data, all of that data may not be present at the same clock cycle boundary.
For instance, the above-described skew budget represented by the addition of Δt1-Δt5 can result in approximately three nsecs of arrival time uncertainty (phase differences) for data transmitted between the memory controller 110 and the memory devices 130 and, more specifically, the data groupings 125 of the memory devices. Three nsecs of skew represent almost three cycles of the 1.25 data clock rate, resulting in phase misalignment at the memory controller 110. Capturing the data within an individual grouping at a clock boundary can be difficult, but manageable. Yet, such phase misalignment may cause portions of a 256-bit data “word” returned by the memory devices in response to a read request to arrive at the controller spread out over three different clock cycles, thus making it impossible to capture the entire data across the 256-bit bus portion 126 at the same clock boundary. Phase alignment is crucial to capturing data at the memory controller and the present invention is directed, in part, to ensuring such alignment at the memory controller.