1. Field of the Invention
The invention relates generally to the field of optical interconnects for computer systems and/or their subsystems as well as networks and/or their subsystems. More particularly, the invention relates to a free-space optical interconnect that includes a fan-out and broadcast signal link.
2. Discussion of the Related Art
The concept of parallel-distributed processing (PDP), which is the theory and practice of massively parallel processing machines, predates the first supercomputers of the 1960s. In practice, high-performance parallel-distributed processing machines are difficult to achieve for several interrelated reasons. On the physical side of the equation, interconnections between n processors or nodes increase as the square of the number of processors (n2); the physical bulk increases as n for the packaging and n2 for the interconnecting wiring; latency due to capacitance increases as the average distance between nodes, which is also proportional to n; heat-removal difficulty increases as the square root of the number of processors (n1/2) due to the surface-to-volume ratio. On the logical side of the equation, message overhead is constant for broadcast mode and can increase as n for relay mode. The impact on software is roughly proportional to n2 due to the increased complexity of parallel-distributed processing algorithms. The overall cost per node increases more rapidly than the number of nodes when all these factors are considered. What is needed is a method of parallel-distributed processing, design and operation that overcomes some or all of these scaling problems.
The present record holder in performance is NEC's “Earth Simulator” topping out at 35.86 teraflops (a teraflop is 1000 gigaflops and a flop is a floating-point operation while “flops” usually refers to a flop per second). While there are many interesting and novel entries in today's supercomputer marathon, the Department of Energy's Advanced Simulation and Computing Initiative (ASCI) has sponsored several of the top contenders. The latest of these is a fifth-generation ASCI system to be built by IBM. The ASCI Purple (AP), if on time and within budget, will arrive by 2005 at a projected cost of approximately $550 per gigaflop with an ultimate option to have a 100-teraflops performance figure in a single machine. (A gigaflop is one billion operations per second.) This is about 12 times the performance of the previous ASCI Q and ASCI White machines. By contrast, a present-day personal computer is typically priced about $750/GF (the minimum cost is probably about $500/GF, i.e., actually less than the ASCI Purple.) This clearly shows that economies of scale are nonexistent to marginal given the factor of nearly 13,000 increase in the number of processors required to achieve the 100 teraflop (TF) figure. (A teraflop is 1000 gigaflops.) The ASCI Purple (AP) is estimated to weight in at 197 tons and cover an area of two basketball courts (volume not specified). The AP will have 12,433 Power5 microprocessors, a total memory bandwidth of 156,000 GBs (gigabytes per seconds), and approximately 50 terabytes (million megabytes) of memory. Power dissipation will be between 4 and 8 MW (megawatts), counting memory, storage, routing hardware and processors.
IBM's Blue Gene3/L (BGL), based on that company's system-on-chip (SOC) technology, will take up four times less space and consume about 5 times less power, it is expected to perform at the 300 to 400 teraflops level. The cost per gigaflop will be about the same at about $600/GF as above. Each of the 65,000 nodes in the BGL will contain two Power PCs, four floating-point units, 8 Mbytes of embedded DRAM, a memory controller, support for gigabit Ethernet, and three interconnect modules. The total number of transistors is expected to be around 5 million, making for a large, expensive, and relatively power-hungry node. The interconnect topology is that of a torus, where each node directly connects to six neighbors. For synchronizing all nodes in the system, hardware called a “broadcast tree” is necessary. Establishing broadcast mode to begin a computation, for example, will require several microseconds. To round out the hardware complement of a node, nine memory chips with connectors (for a total of 256 Mbytes) are foreseen. Four nodes will be placed on a 4 by 2-inch printed-circuit card.
Reliability in these existing machines is a major concern when there are from hundreds-of-thousands to millions of material interconnections (e.g., wires, connectors, solder joints, contact bonding). What is needed is an approach to super computer design that increases reliability.
Moreover, the main, unsolved problem facing today's supercomputers is how to achieve the economies of scale found elsewhere in the industrial world. Machines with tens of thousands of processors cost as much per gigaflop as commodity PCs having only a single processor. Part of the reason for this lack of progress in supercomputer scaling is that the interconnect problem has not yet found a satisfactory solution. Adopting present solutions leads to a reliance on slow and bulky, off-chip hardware to carry the message traffic between processors. A related problem is that communication delays increase as the number of nodes increases, meaning that the law of diminishing returns soon sets in. This issue drives the industry to faster and faster processing nodes to compensate for the communications bottleneck. However, using faster and more powerful nodes increases both the cost per node and the overall power consumption. Smaller, slower, and smarter processors could be effectively used if the communications problem were to be solved in a more reasonable fashion.
Broadcasting is an essential feature of parallel computer interconnects. It is used for synchronization, and is intrinsic to many types of calculations and applications, including memory system coherency control and virtual memory. Many applications running on today's supercomputers were written decades ago for relatively small parallel computers that had good bandwidth for broadcasting. These programs run poorly on today's massively parallel machines. The commonly used interconnects based on cross bars and fat trees as well as all existing parallel computers with n interconnecting nodes consume n channels of bandwidth during broadcasting, so the per port and bisection bandwidths do not change substantially when broadcasting.
Massively parallel high performance computers using fat tree and crossbar interconnect suffer from a mismatch with the software requirement for non-blocking broadcast of short messages. Two of the most common network functions, Allreduce and Sync simultaneously broadcast one-word messages. Such broadcast uses excessive bandwidth in fat-tree interconnects which results in poor system performance. Another function, termed all-to-all communications wherein each computing node in a supercomputer frequently needs to communicate to all other nodes during the course of a computation is an essential functional capability of any modern interconnect scheme. Additionally, these all-to-all messages are typically short, being a few bytes in length. Frequently used algorithms requiring the all-to-all function include parallel versions of matrix transpose and inversion, Fourier transforms, and sorting. The most effective way to implement the all-to-all function is to base it on a true broadcast capability. Present systems can broadcast information, but only by simulating the broadcast function; thus their capability for implementing the all-to-all function is inefficient.
A poor solution to the interconnect problem leads one directly to the general assumption that the most powerful processors available should be crammed into each node to achieve good supercomputer performance, thus hiding the problems inherent in the interconnect by faster processors and higher channel bandwidth. A compromise is possible if some of these other issues are more effectively resolved. The compromise based on a more suitable interconnect would make use of processors not quite on the leading edge of integration and performance to create a supercomputer of lower cost and power consumption with just as great, or more, overall capability. Of course, nothing prevents one from using the ultra-performance processors as nodes in the proposed systems; both cost and capability would rise significantly.
Today's supercomputer architecture at most makes use of 8-way multithreading, meaning that there is hardware support for up to 8 independent program threads. Any multitasking to be found is handled by software. While theoretically alleviating the communications bottle-neck problem and helping to overcome data-dependency issues, the cure is literally worse than the disease since the nodes now spend more time managing the system's tasks in software than is gained by decomposing complex programs into tasks in the first place. What is needed is a scalable and cost effective approach to supercomputers that range in size from a briefcase to a small office building, and in performance from a few teraflops to a few petaflops. (A petaflop is 1000 teraflops.)
Interconnect schemes today are invariably based on material busses and cross bars. As data rates increase and data processors become faster, electrical communication between data-processing nodes becomes more power intensive and expensive. As the number of processing nodes communicating within a system increases, electrical communication become slower due to increased distance and capacitance as well as more cumbersome due to the geometric increase in the number of wires, the volume of the crossbar, as well as its mass and power consumption. Electrical interconnects are reaching their limit of applicability. As speed requirements increase to match the capacity of ever faster processors for handling data, faster electrical interconnects should be based on controlled-impedance transmission lines whose terminations increase power consumption. Even the use of microstrip lines is only a partial solution as, in any fully-connected system, such lines should cross (in different board layers). Close proximity of communication channels produces crosstalk, which is perceived as noise on adjacent channels. Neither of these problems occur in a light-based interconnect.
Optical interconnects, long recognized to be the ideal solution, are still in the experimental stage with practical optical systems connecting only a handful of processors. The main problem with today's optical solutions is conceptual: they are trying to solve a more complicated problem than necessary. This restrictive view has its origins in a limited version of a task or thread: if CPU overhead is required to switch from a computational task to a communications task every time a message arrives, any conceivable computation spread across a multiprocessor system will soon be spending most all of its time on switching overhead. The way around this untenable situation is to create literal, point-to-point connections as is done for the Hypercube™ and Manhattan architectures such as the Transputer™. Thus, the source and destination of every message is determined by hard-wired connections. This idea is carried over into optical schemes where there is an emitter dedicated to every receiver and a single receiver for every emitter. For an optical system serving hundreds of thousands of nodes, the mechanical alignment is an insurmountable nightmare.
Over the years, a number of universities and private and government laboratories have investigated free space optical interconnect (FSOI) methods for multiprocessor computing, communications switching, database searching, and other specific applications. The bulk of the research and implementation of FSOI has been in finding ways to achieve point-to-point communications with narrow beams of light from multiple arrays of emitters, typically narrow-beam lasers, and multiple arrays of photoreceivers. The development of vertical-cavity, surface-emitting lasers (VCSELs) and integrated arrays of VCSELs has been the main impetus behind research in narrow-beam FSOI area. The main problems with FSOI to overcome are alignment, where each laser must hit a specific receiver, and mechanical robustness. U.S. Pat. No. 6,509,992 specifically addresses the problem of misalignment and robustness by disclosing a system of redundant optical paths. When misalignment is detected by a channel-monitoring device, an alternate path is chosen.
Both unfolded configurations, where an array of emitters transmits light across a space to an array of receivers, and folded configurations, where the emitters and receivers lie in the same plane, have been attempted. Most FSOI methods lack direct broadcast capability due to the one-emitter, one-receiver assumption.
Point-to-point optical communications, wherein a narrowly focused laser beam communicates information to a single receiver, represents the extreme case of an optical fan-out of one. A variation is to split a narrowly focused laser beam using one or more beam splitters, each beam splitting producing two beams from the original. In this way, a single narrow beam can be split into 2j beams by j beam splitters, achieving an optical fan-out of a single narrow beam into multiple narrow, but weaker, beams. However, since the receivers are typically small devices, perhaps a tenth of a millimeter in diameter, it is difficult to achieve and maintain optical alignment of the narrow laser beam onto one or more receivers across all but the smallest distances.
A similar method of fan-out has been achieved by use of a diffractive element such as a hologram that splits a single beam into a multiplicity of beams. U.S. Pat. No. 6,452,700 discloses an FSOI backplane based on holographic optical elements mounted on an expansion card. This approach also suffers from sensitivity to alignment which is augmented by temperature sensitivity of the hologram material that affects the size of the fan-out pattern. In a typical implementation of a four-node, point-to-point optical interconnect whose linear dimensions are approximately 100 mm, the constraint on angular alignment of the narrow beam is 1/20th of a degree. Severity of this constraint increases linearly with the size of the interconnect.
What is needed is a cost effectively scalable approach to optical interconnection that is not sensitive to alignment issues.