The present invention relates in general to data processing systems, and in particular, to the interface between dynamic, or clocked, integrated circuit chips in a data processing system.
Modern data processing systems require the transfer of data between dynamic, or clocked, circuits embodied in multiple chips in the system. For example, data may need to be transferred between central processing units (CPUs) in a multi-CPU system, or between a CPU and the memory system which may include a memory controller and off-chip cache. Data transfers are synchronous, and data is expected to be delivered to the circuitry on the chip on a predetermined system cycle. As CPU speeds have increased, the speed of the interface between chips (bus cycle time) has become the limiting constraint as the latency across the interface exceeds the system clock period. In order to maintain system synchronization, the system designer must slow the speed of the bus in order that the cycle on which data arrives be unambiguous.
This may be further understood by referring to FIG. 1A, in which is depicted, in block diagram form a prior art interface between two integrated circuit chips, chip 102 and chip 104 in a data processing system. Each of chips 102 and 104 receive a reference clock 106 coupled to a phase lock loop, PLL 108. PLL 108 generates a local clock, clock 110 in chip 102 and clock 111 in chip 104, locked to reference clock 106. Reference clock 106 provides a xe2x80x9ctime zeroxe2x80x9d reference, and may be asserted for multiple periods of local clocks 110 and 111, depending on the multiplication of PLL 108. The bus clock 113 is derived from reference clock 106 by dividing local clock 110 by a predetermined integer, N, in divider 112. Data to be sent from chip 102 to chip 104 is latched on a predetermined edge of the divided local clock 110 and driven on to data line 116 via driver 118. Data is received at receiver (RX) 120 and captured into destination latch 122 on a predetermined edge of the divided local clock 111 in chip 104. Due to the physical separation of chip 102 and chip 104, the data appears at input 124 of destination latch 122 delayed in time. (The contribution of RX 120 to the latency is typically small relative to the delay due to the data transfer.) The Time delay is referred to as the latency, and will be discussed further in conjunction with FIG. 1B.
Similarly, chip 104 sends data to chip 102 via data line 126. Data to be sent from chip 104 is latched in latch 128 on a predetermined edge of the output signal from divider 130 which divides local clock 111 by N. The data is driven onto data line 126 via driver 132 and captured on destination latch 134 via receiver 136. The data input to chip 102 is captured into data latch 134 on a predetermined edge of an output of divider 130 which also divides local clock 110 by N.
In FIG. 1B, there is illustrated an exemplary timing diagram for interface 100 of FIG. 1A, in accordance with the prior art. Data 115 sent from chip 102 to chip 104 is latched, in latch 114, on a rising edge, t1, of bus clock 113. Bus clock 113 is generated by dividing local clock 110 by N in dividers 112 and 130 in chip 102. Following a delay by the latency, T1, data 117 appears at an input to destination latch 122, and is latched on rising edge t2 of bus clock 123. Bus clock 123 is generated by dividing local clock 111 by N in dividers 112 and 130 in chip 104. Thus, in the prior art in accordance with FIG. 1B, data 125 appears in chip 104 one bus cycle following its launch from chip 102. In FIG. 1B, there is zero skew between bus clock 113 and bus clock 123.
If, in interface 100 in FIG. 1A, the bus clock speed is increased, the latency may exceed one bus clock cycle. Then the exemplary timing diagram illustrated in FIG. 1C may result. As before, data 115 has been latched on edge t1 of bus clock 113. Data 117 appears at input 124 of destination latch 122 after latency time, T1 which is longer than the period of bus clock 113 and bus clock 123. Data 117 is latched on edge t3 of bus clock 123 in chip 104 to provide data 125 on chip 104. If interface 100 between chips 102 and 104 represents the interface having the longest latency from among a plurality of interfaces between chip 102 and the plurality of other chips within a data processing system, then the two cycle latency illustrated in FIG. 1C represents the xe2x80x9ctargetxe2x80x9d cycle for the transmission and capture of data between chips, such as chip 102 and chip 104. The target cycle is the predetermined cycle at which data is expected by the chip. Interfaces having a shorter latency may need to be padded, in accordance with the prior art, in order to ensure synchronous operation. The padding ensures that faster paths in interface 100 have latencies greater than one bus clock cycle and less than two bus clock cycles, whereby data synchronization may be maintained.
This may be further understood by referring now to FIG. 1D, illustrating a plurality 101 of chips, chips 102, 103 and 104. Chip 102 and chip 104 are coupled on xe2x80x9cslowxe2x80x9d path 152 having a long latency, TS. Chip 103 is coupled to chip 102 via xe2x80x9cfastxe2x80x9d path 154 having a short latency period, TF. A xe2x80x9cnominalxe2x80x9d path coupling plurality 101 of chips 102-105 has latency TM, such as the latency on path 156 between chip 102 and chip 105.
The timing diagram in FIG. 1E provides further detail. FIG. 1E illustrates a timing diagram similar to that in FIG. 1C in which the target cycle for the capture of data into a receiving chip is two bus cycles. In FIG. 1E, the nominal latency, TM, is shown to be 1.5 bus cycles, the fast path latency, Txe2x80x2F, is illustrated to be just greater than one bus cycle, and the slow path latency, TS, is shown to be slightly less than two bus cycles. In this case, each of the plurality of chips 101 in FIG. 1D capture data on the target cycle, two bus cycles after data launch.
If, however, the fast path is shorter, illustrated by fast path latency TF data synchronization is lost. In this case, data arrives at chip 103 prior to transition t2 of the chip 103 bus clock as illustrated by the dotted portion of data 117 at chip 103, and is latched into chip 103 after one bus cycle. This is illustrated by the dotted portion of data 125 in chip 103. In order to restore synchronization, the fast path, path 154, between chips 102 and 103 would require padding to increase the fast path latency, from Txe2x80x2F to TF. Consequently, the timing of such a prior art interface is tuned to a specific operating range, a particular interface length, and is valid only for the technology for which the design was timed and analyzed.
Likewise, increasing the clock speed of the chips in FIG. 1D will result in a loss of synchronization. This may be understood by considering an explicit example. The local clock cycle time is first taken have a 1 nanosecond (ns) period. The bus clock will have a period that is a fixed multiple, which will be taken to be two, of the local clock. Let the nominal latency of the interface, TM, be 3 ns with +/xe2x88x920.99 ns of timing variation, i.e. the best case or fast path, TF, is 2 ns and the worse case, or slow path, TS, is 4 ns. The data will arrive after two ns and before four ns. Hence the interface will operate under all conditions i.e. data is guaranteed to arrive after the first bus cycle and before the second bus cycle. However if the speed of the chips is increased to a 0.9 ns cycle time, the bus cycle time is changed to 1.8 ns. In order to ensure enough time for the data to propagate across the interface under worse case conditions the data must not be captured before 2.5 bus cycles, or 4.5 ns, because two bus cycles is less than the slow path time, TS, or 4 ns. Then, in order to operate a 1.8 ns bus cycle, the fastest data can arrive is 1.5*1.8=2.7 ns (one bus cycle earlier), to ensure data arrives on the same cycle for all conditions. However, the earliest data can arrive from the above latency numbers is via the fast path with a TF of 3 nsxe2x88x920.99 ns=2.01 ns. Thus, operating at a bus cycle time of 1.8 ns cannot be supported in a conventional synchronous design. In order to operate synchronously, the bus to processor ratio must be slowed to at least 3:1 and operate at a 2.7 ns cycle time 2.7 ns*1.5 cycles=4.05 ns and 2.7 ns*0.5 cycles=1.35 ns) which militates against the increase in local clock speed.
Thus, there is a need in the art for apparatus and methods to accommodate data transfers between chips in a data processing system having increasing clock speeds. In particular, there is a need for methods and apparatus to ensure data synchronization between chips in data processing systems in which path latencies vary over more than one bus cycle, and in which the need for design specific hardware padding is eliminated.
The aforementioned needs are addressed by the present invention. Accordingly, there is provided, in a first form, an interface apparatus. The apparatus includes a first storage device operable for storing a first set of data values, and a second storage device operable for storing a second set of data values. Member data values of the first and second sets of data values may have a first predetermined width, n. The first and second storage devices are operable for latching data on opposite edges of a first clock signal. Circuitry, coupled to the first and second storage devices, is operable for outputting a first data value from the first storage device and a second data value from the second storage device in response to a second clock signal, the first and second data values constituting an output value having a second width, 2n.
There is also provided, an interface apparatus in an alternative embodiment. The apparatus includes a first plurality of storage devices, each storage device of the plurality is operable for storing a corresponding one of plurality of sets of data values. Each member data value of the plurality of sets has a first predetermined first bit width, n. The first plurality of storage devices stores data values in response to a first clock signal. Selection circuitry, coupled to the plurality of storage devices, is operable for sequentially outputting each corresponding set of data values, in which the data values are received in an input data stream. The circuitry sequentially outputs each corresponding set of data values in response to at least one first control signal. Circuitry, coupled to the plurality of storage devices, is operable for receiving the plurality of sets of data values and sequentially outputting, in response thereto, a set of output data values, each output data value having a predetermined second bit width, mxc2x7n. The output data values are output in response to a second clock.
Additionally, there is provided, in a second form a data processing system. The system includes a first data processing device a second data processing device coupled to the first data processing device via an elastic interface. The elastic interface contains a first storage device operable for storing a first set of data values, and a second storage device operable for storing a second set of data values. The first and second storage devices are operable for latching data on opposite edges of a first clock signal. Member data values of the first and second sets of data values have a first predetermined width, n Circuitry, coupled to the first and second storage devices, is operable for outputting a first data value from the first storage device and a second data value from the second storage device in response to a second clock signal, the first and second data values constituting an output value having a second width, 2n.
There is further provided a data processing system in an alternative embodiment. The system includes a first data processing device, and a second data processing device. The first and second devices are coupled via an elastic interface. The interface has a first plurality of storage devices, each storage device of the plurality is operable for storing a corresponding one of plurality of sets of data values. Each member data value of the plurality of sets has a first predetermined first bit width, n The first plurality of storage devices store data values in response to a first clock signal. Selection circuitry, coupled to the plurality of storage devices, is operable for sequentially outputting each corresponding set of data values. The data values are received in an input data stream, and the selection circuitry sequentially outputs each corresponding set of data values in response to at least one first control signal. Circuitry, coupled to the plurality of storage devices, is operable for receiving the plurality of sets of data values and sequentially outputting, in response thereto, a set of output data values, each output data value having a predetermined second bit width, mxc2x7n, wherein the output data values are output in response to a second clock.
There is also provided, in a third form, a method of interfacing data processing devices. The method includes storing a first plurality of sets of data values in a first plurality of storage elements. Each data value of each of the first plurality of sets is stored for a predetermined time interval relative to a first clock. Each data value is communicated in a data stream between the data processing devices. Also included is selectively sequentially receiving members of the first plurality of data values at a second plurality of storage elements having m storage elements. The members received in the receiving step are stored in corresponding elements of the second plurality of storage elements in response to a second clock. An output of each storage element of the second plurality of storage elements providing an n-bit wide portion of an (mxc2x7n)-bit wide output data value.