Most computer interconnects serve a limited number of nodes or endpoints. Larger interconnects are typically built up from smaller interconnect modules by joining one interconnect module to another in the form of trees, fat trees, and other networks of switches (known as switched fabrics) configured in a variety of different topologies.
Each switch in such a network may connect to one or more host computers and connect to one or more storage devices. In addition, there may be switch-to-switch connections and switch-to-concentrator connections. The switch-to-switch connections are typically of higher bandwidth that the switch-to-host or switch-to-storage connections so that data between switches can be distributed to multiple hosts or storage devices. A concentrator, also referred to as a level-2 switch, takes input from one or more switches, forming a bridge between one or more switched fabrics and other devices such as gateways to other data networks. The flow of data in these implementations must be internally managed as to data paths, packing messages for switch-to-switch traffic and unpacking such messages for distribution to individual endpoints (host computers or storage devices).
Referring to FIG. 1 (Prior Art) illustrates a typical n-way (or n-by-n) interconnect based on broadcast light as described in U.S. Pat. No. 7,970,279 (“N-way serial-channel interconnect”). The figure portrays a fully connected, n-by-n interconnect from the inputs (typically from n nodes or endpoints) to the outputs (typically to the same n nodes or endpoints).
Broadcast distribution module 100 (labeled “DBOI” for direct-broadcast, optical interconnect) distributes information encoded in light (in the preferred embodiment) or other data carrier means from each of the n inputs 110. This broadcast distribution is indicated by the plurality of fan-out and fan-in lines labeled 115. In the preferred embodiment, these lines 115 schematically indicate the broadcast distribution of light broadcast from each of the inputs 110 and collection to each of the output lines 120. The use of “light” in this description is not meant to restrict to optical means as any carrier of information that is capable of being manipulated in the manner indicated by FIG. 1 is valid in the context of FIG. 1. In the optical version of the interconnect, described by the above-referenced patent, the light broadcast from the several inputs is collected by lenses and focused on the outputs 120 which are multi-mode fibers in the optical case and transmission lines or cables in the electrical case. Each of the four collection points (the tail of the arrows 120) contains n signals as four replications of the input data stream of the n inputs 110. In the original implementation of the DBOI interconnect, n was 32 and each input stream is split four ways by an optical fan-out device replication of each of the 32 data streams. These optical signals are then combined into four detector arrays or fiber-optic bundles, each containing a copy of the original 32 input data streams. Hence the depiction of 4 data streams 120 exiting the broadcast distribution module 100.
Note that the number n=32 and the optical fan-out of four were chosen for convenience only. Other choices are possible. For example, a 128-way interconnect might have 16-fold optical or electrical fan-outs leading to 16 output bundles labeled 120 instead of the four depicted in FIG. 1. The partitions 135 are meant to illustrate this four-fold modular structure of this particular embodiment.
Still referring to FIG. 1, module 130 (labeled “EONIC” for electro-optical node interface controller) receives the optical signals 120 (in the preferred embodiment) and converts them into electrical signals where each of the n signals in each of the four bundles represented by an arrow 120 are fanned out 8 ways, carrying 8 copies of each original input signal 110, thus giving a total number of signals of n=32 in each partition. Each of the partitions separated by lines 135 in module 130 contains n/4 outputs or endpoints 140. For example, the top partition in module 130 would contain, for n=32, outputs to nodes 1 through 8; the second partition, outputs 9 through 16; and so on. Thus it is easy to see that each input stream 110 is represented at each output stream 140.
Note particularly that there are no switches or routing mechanisms inside either module 100 or module 130. That is, data are free to flow from any of the n inputs 110 to any of the n outputs 140 without any impediment. The immediate result is that there can be no data congestion within the interconnect represented by FIG. 1.
In practice, module 130 contains additional software and/or hardware to collect, store, and gate the various digital data streams according to encoded destinations as well as flow-control circuitry needed to prevent contention at the output nodes 140. These additional functions to the fan-out and fan-in circuits were described in the above-referenced U.S. Pat. No. 7,970,279.
Referring to FIG. 2 (Prior Art) illustrates how to separately interconnect four sets of n nodes, each set being fully interconnected within itself. The resulting 4n nodes are, of course, not fully interconnected.
The four modules 200 each have independent inputs 210 of n channels each. As described above, each set is distributed optically (in the preferred embodiment) and presented to the four sets of optical outputs 220 in each interconnect row. The four EONICs 230 receive the four sets of optical inputs and distribute and combine them, as described above, to the four sets of outputs 240. The 4n independent inputs 210 are treated in groups of four such that a data stream presented to the top module 200, for example, cannot appear on any of the three bottom modules 230. Note that the four sections indicated by the sequence of indicators 200, 210, 220, 230, and 240, are not distinguished since they are copies of the same n-by-n interconnect.
Today's computing clusters as envisioned for data centers, cloud computing, and supercomputer applications are meant to serve more than a few dozen nodes or endpoints that are subsumed by a single switched interconnect. Typical methods of interconnect extensions make use of various problematic devices to ensure that each node in a many-node system can be connected to any other node. Note that the possibility of any given node-to-node connection is not necessarily permanently established nor may such a connection be established when desired. For example, the switches and associated routing hardware within, and the software controlling, these switched networks may become internally blocked by message traffic in competing data paths. In addition to data congestion in a switch network, data must often be passed from switch to switch in the form of discrete hops, making the node-to-node communication take place in a series of stages, where delay and blocking may occur at each stage. In addition, the heterogeneous nature of the diverse hardware elements in such a switched fabric of switches adds additional complications and costs to building and maintaining a data center, computing or storage cloud, or supercomputer cluster.
Heretofore, there has been no approach to interconnecting nodes that obviates the above-discussed deficiencies. What is needed is a better technology to interconnect nodes.