Workers in the arts of making and using information-processing networks are well aware of today's challenge of balancing cost-effective hardware against evolving transmission speeds. This is what this invention addresses.
For instance, today's users are more and more asking for equipment which is simpler, yet less expensive.
In considering the design of our next-generation Server, we at first favored "distributed (ring-) architecture" with interconnect based on the IEEE P1596 SCI standard. (SCI: scaleable coherent interconnect; assume "Server" is a collection of "nodes" with supplemental circuitry; i.e., CPUs distributed in rings to affect a single, unified system; a "ringlet" is a min-size ring.)
This standard provides that when data is transferred between nodes, the transfer will be synchronous with a "transmit clock" signal (C.sub.T) which is transmitted with the data and typically generated by its own oscillator (O.sub.1). But in each node, incoming data would have to be resynchronized to a "local clock" (C.sub.L) which is specified to have the same frequency as the "transmit clock" signal (C.sub.T) arriving with the data, but has its own local oscillator O.sub.L. However, due to oscillator variation and thermal changes, the transmit clock (C.sub.T) can be expected to drift with respect to the local clock (C.sub.L). Thus, anytime data is transferred between systems, resynchronization will be necessary; e.g., when data is transferred along each link between nodes it must be synchronized (with the local clock).
SCI Nodes, Ringlets:
As workers know, each node in a SCI RING, or Ringlet, can typically comprise I/O devices, memory and processors. In fact, each node is usually capable of operating as an independent server. However, it is often desirable to have a single unitary system which includes more processors than is practical to place on a single bus--thus, a multi-node server. And, with processor speeds increasing faster than bus speeds, the number of processors per bus is actually decreasing.
SCI is an attempt to solve this problem by creating an interface that can transparently connect multiple servers (multiple nodes) and make them appear to be a single system whose processing power is the sum of the connected system's processing power. It does this by providing a physical connection for passing data, as well as a logical cache coherence protocol. Because all the processors in the whole system are "coherent" (i.e., are directly linked to all data sites--as opposed to the more usual, non-coherent string of processors in a distributed-architecture system, where a processor in a given node must transmit a "special message" in order to access other nodes--thus SCI has direct-access to all, there are no "local copies") and share a single memory space, each processor is able to access any part of memory or I/O devices across the entire system as though they were local.
This standard SCI approach seems more complex and difficult to implement than it needs to be. My invention is less so, while addressing the same need.
"Delay-line Method":
The SCI standard approach to this situation is to feed each of the incoming data bits, plus a flag bit, through separate tapped delay lines: a "Delay-line Method". Using this approach, it is possible to compare the phases of the incoming (transmit) -clock and the local clock, and then choose the appropriate tap to sample the data with respect to (e.g., in sync with) the local clock, thereby always assuring that the data is "valid" when sampled.
As the two clocks change phase with respect to one another, and the selected tap is adjusted, eventually, the selected tap will be either the first or the last tap of the delay line, depending on whether the incoming or local clock is faster. At this point there will be no further room for adjustment and the tap will have to be switched from the end-tap to one of the center-taps where the data is also valid with respect to (in sync with) the local clock. This will make the system either fail to sample a data bit or to sample the same bit twice, depending upon which clock is faster. But, since data is sent in packets, if one skips, or resamples, data only between packets, no harm results.
Problems:
This method of synchronization places strict demands on the ASIC technology used to implement it. CMOS is disfavored because it tends to have a high spread between the best- and worst-case gate delays over varying conditions of temperature, voltage and process; thus, the preferred implementation of this "Delay-line" method is with an ECL gate array though this is expensive. Also, a large number of gates is needed. The method requires 17 tapped delay lines, each of which must have a total delay of at least 2 clock periods under best-case conditions (temperature, voltage and process), and each delay line requires 16 or more stages, or taps, each requiring several gates. Laying-out and routing these delay lines also places additional constraints on the gate array architecture.
In spite of these constraints and expense, there are workers today who propose offering parts designed to this standard and using this Delay-line method.
This invention addresses this problem, teaching the use of a "circular FIFO array".
An object hereof is to address at least some of the foregoing problems and to provide at least some of the mentioned, and other, advantages.