1. Field of the Invention
The present invention relates to computer systems and fiber optics communication devices and more particularly to wavelength or frequency division multiple access of node interconnection in massively parallel processor networks according to local or express data traffic routing.
2. Description of Related Art
A key barrier to higher performance levels in massively parallel processors (MPPs) is the communication limits that exist amongst the individual processors and between the processors and main memory. Such communication limits include message latencies that could be reduced, e.g., by increasing bandwidth. The time delays between initial message transmission and reception stem from the use of information packets that are relayed many times, e.g., in a bucket-brigade fashion from node-to-node within a communication fabric. At each such node, the packet address header is read to route each message packet appropriately to its intended destination. If this occurs more than once, unnecessary latency in the delivery of the message packet is added and can stall processors waiting for the data. Performance suffers when the processors are starved of needed data. Too narrow a data bandwidth can also degrade performance by forcing the data needed by the processor to be broken into more than one packet. The processor cannot continue until all the required packets are received.
Multiprocessing is of great current interest for both general HPC applications, massively parallel processing, and integrated sensor/processor systems. Increases in system node count, computing power per node, and/or sensor-generated data rate increase the communication required to maintain a balanced system that fully utilizes available computing power and sensor data. Traditional electronic solutions are not keeping pace with advances in processor performance and sensor complexity and have increasing difficulty providing sufficient communication bandwidth. The trend towards shared memory (away from message passing) in multiprocessors places additional stress on interprocessor communications due to the short messages and rapid memory access associated with cache-to-cache coherence traffic.
Remote hyper-spectral sensing allows potential threats to be assessed with spectral information third axis imaging techniques. Three-dimensional data sets comprise layers of two-dimensional image pixels, with each layer representing a different spectral window. If each image layer contains 10.sup.3 .times.10.sup.3 pixels each 12 bits deep and there are 10.sup.3 layers, one for each spectral bin, each cube would represent 1.2.times.10.sup.10 bits of data. Thus, hyper-spectral sensing can generate one data cube per second, or a sensor data flow of 1.2.times.10.sup.10 bits per second, enough to overwhelm communications, storage, and analysts' resources. The information flow can be reduced by down selecting data with artificial intelligence processors at each stage of a surveillance operation, e.g., using on-platform mathematical image transform filtering. However, such "smart" techniques require sophisticated data processing capabilities at the remote sensor platform.
Fourier transform techniques are conventionally used to "look" for high-spatial frequency, e.g., localized, events in each spectral bin. The number of floating-point operations required to perform a two-dimensional Fourier transform at every spectral slice of the data cube is .about.6N.sup.3 log.sub.2 N, where N is the number of elements along the edge of the data cube, e.g., N.sup.3 =the number of data elements in the cube. Since N=1000, about .about.6.times.10.sup.10 floating point operations must done on each cube, e.g., 6.times.10.sup.10 operations/cube times one data cubes/s=60 gigaflops, which indicates a multiprocessing approach in which many microprocessors are ganged together in parallel. However, the communication traffic generated, e.g., sensor/processor, processor/processor, and processor/memory, would be about 60 gigabytes/s in a balanced computing environment. This is far beyond the capabilities of prior art electronic busses.
Two basic approaches exist to increase a computer system's processing capability, e.g., run the clocks faster and/or do more in each clock period. The original Intel microprocessors had four-bit and eight-bit instruction words that were clocked at well under one megahertz. Today, individual Intel PENTIUM PRO microprocessors carry thirty-two and sixty-four bit instruction words that are clocked well over one hundred megahertz.
Parallel processing is an obvious way to increase the processing that occurs in each clock period. Super-micros, for example, have been connected by Intel and other prior art researchers into MPP networks for specialized applications. The processors communicate amongst themselves over network nodes that carry both local traffic and data earmarked for other regions of the network and input-output (I/O). But the performance improvement provided by putting more processors in parallel falls off as systems are scaled up beyond a hundred nodes by latency and bandwidth limitations. Each processor spends more time waiting for data the more the parallel system is scaled up. Such problems have been encountered by the Cray Research torus program with three-dimensional interwoven rings, the Intel paragon mesh program with two-dimensional rings without wraparound, and the Convex exemplar program where the symmetric multiprocessor (SMP) groups are on parallel rings.
William J. Dally describes the use of express channels in such systems in U.S. Pat. No. 5,367,642, issued Nov. 22, 1994, and U.S. Pat. No. 5,475,857, Dec. 12, 1995, and incorporated herein by reference. The express channels "serve as parallel alternative paths to local channels between non-local nodes of the network." See, Dally, William J. "Express Cubes: Improving the Performance of K-ary N-cube Interconnection Networks" VLSI Memorandum 89-564, Massachusetts Institute of Technology, Laboratory for Computer Science, October 1989. The object is to increase system throughput and reduce latency by eliminating some of the needless data congestion at the nodes. The method is analogous to the use of express trains and busses that carry commuters into the city core from the outlying regions. The separation of local commuters from distance commuters makes for a more efficient transportation system by reducing congestion.
Dally describes an interconnection network of an array of nodes, where each node in the array is capable of routing messages. Immediately adjacent nodes are connected to each other by local channels. Messages traveling from a source node to a destination node travel through local channels and through intermediate nodes interconnected by local channels between the nodes. The local channels may comprise duplex pairs of unidirectional channels with a separate unidirectional channel for carrying messages to a given node, as well as a separate unidirectional channel for carrying messages from the given node. "Express channels" are included that run in parallel with the local and intermediate channels. Such provide an alternative message path between the source nodes and the destination nodes.
Each express channel provides a path between pairs of more separated nodes that bypass the local traffic in the intermediate nodes. As such, messages traveling on the express channels are not incrementally delayed by each of the nodes between the source nodes and the destination nodes. The interconnection network further includes interchanges for interfacing the local channels with the express channels so that messages may travel over either the local channel or the express channel. Such an interconnection network is particularly well suited for a "k-ary, n-cube" topology.
In the simplest embodiment described, only a single express channel is used for any given row of an interconnection network. However, the use of additional express channels is generally preferred by Dally. The interconnection network nodes may comprise processors as well as memory, and the processors may include private memory. The interchange points are situated periodically throughout the interconnection network.
A hierarchical interchange organization is supposedly well suited for use with multiple express channels. In one hierarchical interchange organization, a first interchange interfaces a first of the express channels with the local channels, and a second interchange interfaces the second of the express channels with the local channels. Other hierarchical interchange configurations include more than two levels of express channels. Additional interchanges may be included to interface the multiple express channels with each other. Hierarchical interchanges may be positioned in a stepwise fashion so that messages can bubble up to a top level express channel and then descend back down to a bottom local channel level, e.g., to maximize efficiency. The benefit of such hierarchical organization is that the distance component of latency only increases logarithmically with increasing distance. Still further, the express channels may be provided in multiple dimensions. For instance, express channels may be provided for linear arrays of nodes oriented in each of the multiple dimensions.
Dally observes that low-dimensional k-ary n-cube interconnection networks have node delays that dominate their wire delays. For any message sent from a starting node to a destination node, the total delay the messages experience is primarily due to the delays incurred by traveling through intermediate nodes, compared to the delays incurred by traveling over wire channels. An ideal network could transfer messages at close to the speed of light. Unfortunately, low-dimensional, n=2 or n=3, k-ary n-cube interconnection networks in real systems have a distance-related component of latency that is more than an order of magnitude less than the speed of light. Low-dimensional k-ary n-cube interconnection networks also have channel widths that are limited by the node pin count, rather than being limited by wire density. The channel width of such networks can be limited by the wire density, but as a practical matter, the pin density and pin count primarily limit the channel width.
The ratio of node delay to wire delay and the ratio of pin density to wire density cannot generally be balanced in ordinary k-ary n-cube networks. By adding express channels, the wire length and wire density can be adjusted to be independent of the choice of radix (k), dimension (n) and channel width (w), e.g., a so-called "express cube". In general, the wire length of the express channels are increased to the point where the wire delays dominate the node delays and the latency approaches its optimal limit. The number of express channels is adjusted to increase throughput until the available wiring media is saturated.
The use of wavelength division multiplexing in communication networks is not new. Wavelength division multiplexing has conventionally been used to increase network capacity and bandwidth, to allocate bandwidth, and even to route information. Wavelength division multiplexing and free-space optics have been used to interconnect circuit boards within a computer. Some researchers have proposed a wavelength reuse scheme that enables larger asynchronous transfer mode (ATM) networks. None of such previous approaches has reduced the data latencies and increased message bandwidth.
The prior art has not suggested a very practical way to implement express channels and express cubes. The use of ordinary wire interconnects in the electronic embodiments of Dally's topologies prevents the use of long-distance express channels because high data rates cannot be supported. Every express channel link adds an additional electrical cable assembly to the system, and serious cost and mechanical design and layout difficulties are thereby encountered.
An information source must provide sufficient power to transmit to many destinations simultaneously because optical receivers will not produce error-free outputs unless they receive strong optical signals. When there are a lot of destinations, a large amount of power can be required. In response, systems designers can lower the power delivered to each destination, prune the number of destinations, or limit the transmitter power. None of such options are particularly desirable. Reducing the power received by each destination will reduce the data transmission rates that are possible, which slows down communication and the message throughput. Increasing the optical power is not always practical, because the devices used have maximum power limits or "eye-safe" laser powers that may be exceeded. Reducing the system size to have fewer nodes is incongruous when the object is to build large, parallel processing machines in which computational performance scales with number of nodes. Architectures that use n-to-n broadcast, n-to-n star couplers, or n-to-1 combining in the optical domain suffer from the power inefficiencies of 1/n.
The hardware design is complicated as more wavelengths are required to be emitted from each node in a system. Wavelength tunable lasers presently provide, at most, sixty-four different wavelengths. Multiple sources, each at a fixed wavelength, are needed to generate large numbers of wavelengths. But such multiplicity necessitates complex electronics and associated electronic packaging to build the transmitters involved.
A related issue is the number of wavelengths used in the system. Conventional telecom systems typically use only four wavelengths and are expected to grow to as many as thirty-two over time. If the system requires many wavelengths to support many nodes, it is unattractive because the system size, e.g., node count, will be limited by wavelength division multiplexing technology. If each node must receive all system wavelengths, and there are many system wavelengths, the cost and size of the opto-electronic receivers becomes a problem. In addition, interfacing the electronic output of all receivers to the node is a costly, complicated electronic problem for more than four receivers.
Where centralized controllers are needed to establish communication paths, e.g., to make sure two messages don't interfere with one another, the operation of the controller is unacceptably slow. Centralized controllers require information about transmissions occurring non-locally in the interconnect; it doesn't have to be a physically centralized device. Slow speed results from the need to gather information about transmission requests from all nodes, process this information, and then redistribute it to the nodes. Centralized control is slow because it takes time to set up the circuits. Such control is also complex and adds cost.
Architectures based on distributed, all-optical space switches or tunable wavelength switches require centralized control, because no logic or buffering is done within the switch fabric. The slow tuning of many wavelength switches can slow the system with delays that can exceed a microsecond.
Some schemes require that all messages be launched at the same time, to make sure that certain kinds of messages never coexist simultaneously on the same fiber and wavelength. Such prevents any two messages from interferring with one another. Global synchronization is difficult due to having to maintain accurate timing across a large system. Delays tend to vary, resulting in desynchronization.
Other schemes do not guarantee to prevent interference between messages. It is assumed that "collisions" that garble messages occurring on the same wavelength, spatial position, and time slot are detected, and that messages can be re-transmitted. This is undesirable because it complicates system management (collision detection hardware required), it increases communication delay when many messages exist (due to re-transmission after collisions), and it reduces the total throughput of information through the interconnect, typically by factors of 2 to 3. Such factors of 2-to-3 usually require global synchronization, or else even greater degradation occurs.
Architectures using optical broadcast require either central control bus arbitration or global synchronization pre-allocated transmission times, or else they cannot guarantee delivery.
Sasayama et al., describe in U.S. Pat. No. 5,506,712, a time-slotted, synchronized wavelength division multiplexing approach to connect each of m inputs to some number of outputs. It requires one system wavelength for every input port, e.g., each tunable frequency converter means on each input highway assigns mutually different frequency channels to the optical signals on each highway. This is undesirable, because the number of frequency channels is likely to be limited by practical constraints; for example, the stability of source and multiplexer components. Such, in turn, limit the number of inputs to the system. Sasayama, et al., requires an m-wavelength tunable source at each of the m switch inputs. Such components are currently research curiosities and are not commercially available. Inexpensive components of this type are unlikely to become available in the near future. Thus, the large number of difficult-to-obtain components in the system is undesirable because it adds significant cost. An additional disadvantage of this patent is the requirement for time-slotting. Each message on every input is transmitted in synchronization with all other messages, to ensure that no two messages are broadcast on the same wavelength in the same timeslot. This requires global synchronization because it adds complexity. Maintaining timing synchronization is difficult in a distributed system.
Charles Husbands describes in U.S. Pat. No. 5,446,572, a broadcast architecture in which the optical power is broadcast from each transmitter into a common channel connected to every receiver in the system. Such combining reduces the power available to each connection by 1/n, where n is the number of wavelength division multiplexers being combined. So a lot of optical power is required from each transmitter to begin with, and the transmitter power must be increased with each transmitter/receiver node added to a system. High levels of optical power reduce reliability, increase power consumption, and can prevent the system from being "eye safe" for maintenance personnel. But reducing the overall power even as the number of nodes increase forces lower bit rates, because the receiver sensitivity requirements for error-free operation at high bit rate will be exceeded.
For large numbers of nodes, it is difficult to build an n:1 combiner. It would be better to guide each optical output to a single destination to make the best use of the optical power, e.g., using simple 2:1 combining and dividing wavelength-selective elements. Requiring n wavelength sources at each of the n system transmitters means a very large number of sources (n.times.n) are needed in a system. This adds cost, reduces reliability, and requires substantial electronic circuitry to address each wavelength source independently.
Sotom describes in U.S. Pat. No. 5,485,297, an optical switch that uses tunable wavelength division multiplexing sources, and optical switch matrices plus star couplers to route wavelength division multiplexing transmissions to a particular destination. The purpose of the switches is to minimize the size of the star coupler to improve optical power utilization and minimize the number of system wavelengths required by routing messages on the same wavelength to different star couplers. The disadvantage of this approach is the need for a centralized control that analyzes the traffic pattern for the inputs and then sets all the switches to make sure two signals on the same wavelength never go to the same star. This kind of centralized control is slow, complex, and costly.
Sharony et al., describes in U.S. Pat. No. 5,495,356, a time-slotted approach that requires global synchronization. Optical space switches, e.g., photonic switches in FIG. 4, or wavelength switching is used for wavelength selective switching. Centralized control is needed to operate such switches. Sharony et al., also uses 1:n splitting which is power inefficient and has limited switch tuning times.
H. Obara and Y. Hamazumi, in "Star coupler based wavelength division multiplexer switch employing tunable devices with reduced tunability range", Electronics Letters, Jun. 18, 1992; Vol. 28, No. 13, pp. 1268-1270, describe a star coupler broadcast that is power inefficient. A set of tunable laser diodes are used, corresponding to one tunable laser per switch input. Fewer tunable components, about one per every four nodes, is preferred for improved cost and reliability. The architecture described requires centralized control and can be rather complex.
M. Kavehrad and M. Tabiani describe in "Selective broadcast optical passive star coupler design for dense wavelength division multiplexer networks", IEEE Photonics Letters, vol. 3, no. 5, May 1991, pp. 487-489, reducing the splitting loss power inefficiency by selective broadcast optical star coupler to limit broadcasts to only a few nodes. The proposed device appears complicated to build and attempts to tradeoff splitting losses against the number of system wavelengths used. In one implementation shown, the number of system wavelengths equals the number of nodes. This is unattractive, because the total number of system wavelengths is likely to be technologically limited and limits the size of the system by limiting the number of nodes.
Darcie et al., describes in U.S. Pat. No. 5,483,369, systems based on multiplexing of RF signals on carrier frequencies up to a few GHz, or 10.sup.9 Hz, and using surface-acoustic-wave devices. Such does not translate well to the multiplexing of optical signals on carrier frequencies on the order of 10.sup.15 Hz. The system Darcie proposes uses carrier frequencies that are so low that only a very few high speed (GHz) channels can be multiplexed, since the desired channel modulation/transmission rate must always be significantly less than the total spectral extent.