Modern data processing systems require the transfer of data between dynamic, or clocked, circuits embodied in multiple chips in the system. For example, data may need to be transferred between central processing units (CPUs) in a multi-CPU system, or between a CPU and the memory system which may include a memory controller and off-chip cache. Data transfers are synchronous, and data is expected to be delivered to the circuitry on the chip on a predetermined system cycle. As CPU speeds have increased, the speed of the interface between chips (bus cycle time) has become the limiting constraint as the latency across the interface exceeds the system clock period. In order to maintain system synchronization, the system designer must slow the speed of the bus in order that the cycle on which data arrives be unambiguous.
This may be further understood by referring to FIG. 1A, in which is depicted, in block diagram form, a prior art interface between two integrated circuit chips, chip 102 and chip 104 in a data processing system. Each of chips 102 and 104 receive a reference clock 106 coupled to a phase lock loop, PLL 108. PLL 108 generates a local clock, clock 110 in chip 102 and clock 111 in chip 104, locked to reference clock 106. Reference clock 106 provides a "time zero" reference, and may be asserted for multiple periods of local clocks 110 and 111, depending on the multiplication of PLL 108. The bus clock 113 is derived from reference clock 106 by dividing local clock 110 by a predetermined integer, N, in divider 112. Data to be sent from chip 102 to chip 104 is latched on a predetermined edge of the divided local clock 110 and driven on to data line 116 via driver 118. Data is received at receiver (RX) 120 and captured into destination latch 122 on a predetermined edge of the divided local clock 111 in chip 104. Due to the physical separation of chip 102 and chip 104, the data appears at input 124 of destination latch 122 delayed in time. (The contribution of RX 120 to the latency is typically small relative to the delay due to the data transfer.) The time delay is referred to as the latency, and will be discussed further in conjunction with FIG. 1B.
Similarly, chip 104 sends data to chip 102 via data line 126. Data to be sent from chip 104 is latched in latch 128 on a predetermined edge of the output signal from divider 130 which divides local clock 111 by N. The data is driven onto data line 126 via driver 132 and captured on destination latch 134 via receiver 136. The data input to chip 102 is captured into data latch 134 on a predetermined edge of an output of divider 130 which also divides local clock 110 by N.
In FIG. 1B, there is illustrated an exemplary timing diagram for interface 100 of FIG. 1A, in accordance with the prior art. Data 115 sent from chip 102 to chip 104 is latched, in latch 114, on a rising edge, t.sub.1, of bus clock 113. Bus clock 113 is generated by dividing local clock 110 by N in dividers 112 and 130 in chip 102. Following a delay by the latency, T.sub.1, data 117 appears at an input to destination latch 122, and is latched on rising edge t.sub.2 of bus clock 123. Bus clock 123 is generated by dividing local clock 111 by N in dividers 112 and 130 in chip 104. Thus, in the prior art in accordance with FIG. 1B, data 125 appears in chip 104 one bus cycle following its launch from chip 102. In FIG. 1B, there is zero skew between bus clock 113 and bus clock 123.
If, in interface 100 in FIG. 1A, the bus clock speed is increased, the latency may exceed one bus clock cycle. Then the exemplary timing diagram illustrated in FIG. 1C may result. As before, data 115 has been latched on edge t.sub.1 of bus clock 113. Data 117 appears at input 124 of destination latch 122 after latency time, T.sub.1 which is longer than the period of bus clock 113 and bus clock 123. Data 117 is latched on edge T.sub.3 of bus clock 123 in chip 104 to provide data 125 on chip 104. If interface 100 between chips 102 and 104 represents the interface having the longest latency from among a plurality of interfaces between chip 102 and the plurality of other chips within a data processing system, then the two cycle latency illustrated in FIG. 1C represents the "target" cycle for the transmission and capture of data between chips, such as chip 102 and chip 104. The target cycle is the predetermined cycle at which data is expected by the chip. Interfaces having a shorter latency may need to be padded, in accordance with the prior art, in order to ensure synchronous operation. The padding ensures that faster paths in interface 100 have latencies greater than one bus clock cycle and less than two bus clock cycles, whereby data synchronization may be maintained.
This may be further understood by referring now to FIG. 1D, illustrating a plurality 101 of chips, chips 102, 103 and 104. Chip 102 and chip 104 are coupled on "slow" path 152 having a long latency, T.sub.S. Chip 103 is coupled to chip 102 via "fast" path 154 having a short latency period, T.sub.F. A "nominal" path coupling plurality 101 of chips 102-105 has latency T.sub.M, such as the latency on path 156 between chip 102 and chip 105.
The timing diagram in FIG. 1E provides further detail. FIG. 1E illustrates a timing diagram similar to that in FIG. 1C in which the target cycle for the capture of data into a receiving chip is two bus cycles. In FIG. 1E, the nominal latency, T.sub.M, is shown to be 1.5 bus cycles, the fast path latency, T.sub.F, is illustrated to be just greater than one bus cycle, and the slow path latency, T.sub.S, is shown to be slightly less than two bus cycles. In this case, each of the plurality of chips 101 in FIG. 1D capture data on the target cycle, two bus cycles after data launch.
If, however, the fast path is shorter, illustrated by fast path latency T'.sub.F data synchronization is lost. In this case, data arrives at chip 103 prior to transition T.sub.2 of the chip 103 bus clock as illustrated by the dotted portion of data 117 at chip 103, and is latched into chip 103 after one bus cycle. This is illustrated by the dotted portion of data 125 in chip 103. In order to restore synchronization, the fast path, path 154, between chips 102 and 103 would require padding to increase the fast path latency, from T'.sub.F to T.sub.F. Consequently, the timing of such a prior art interface is tuned to a specific operating range, a particular interface length, and is valid only for the technology for which the design was timed and analyzed.
Likewise, increasing the clock speed of the chips in FIG. 1D will result in a loss of synchronization. This may be understood by considering an explicit example. The local clock cycle time is first taken have a 1 nanosecond (ns) period. The bus clock will have a period that is a fixed multiple, which will be taken to be two, of the local clock. Let the nominal latency of the interface, T.sub.M, be 3 ns with .+-.0.99 ns of timing variation, i.e. the best case or fast path, T.sub.F, is 2 ns and the worse case, or slow path, T.sub.S, is 4 ns. The data will arrive after two ns and before four ns. Hence the interface will operate under all conditions i.e. data is guaranteed to arrive after the first bus cycle and before the second bus cycle. However if the speed of the chips is increased to a 0.9 ns cycle time, the bus cycle time is changed to 1.8 ns. In order to ensure enough time for the data to propagate across the interface under worse case conditions the data must not be captured before 2.5 bus cycles, or 4.5 ns, because two bus cycles is less than the slow path time, T.sub.S, or 4 ns. Then, in order to operate a 1.8 ns bus cycle, the fastest data can arrive is 1.5*1.8=2.7 ns (one bus cycle earlier), to ensure data arrives on the same cycle for all conditions. However, the earliest data can arrive from the above latency numbers is via the fast path with a T.sub.F of 3 ns-0.99 ns=2.01 ns. Thus, operating at a bus cycle time of 1.8 ns cannot be supported in a conventional synchronous design. In order to operate synchronously, the bus to processor ratio must be slowed to at least 3:1 and operate at a 2.7 ns cycle time (2.7 ns*1.5 cycles=4.05ns and 2.7nS*0.5 cycles=1.35ns) which militates against the increase in local clock speed.
Thus, there is a need in the art for apparatus and methods to accommodate data transfers between chips in a data processing system having increasing clock speeds. In particular, there is a need for methods and apparatus to ensure data synchronization between chips in data processing systems in which path latencies vary over more than one bus cycle, and in which the need for design specific hardware padding is eliminated.