Simple computer systems typically employ one or more static buses to couple together processors, memory, input/output (I/O) systems, and the like. However, more modern, high-performance computer systems often interconnect multiple processors, memory modules, I/O blocks, and so forth by way of multiple, reconfigurable, internal communication paths. For example, in the case of multiprocessing systems employing a single-instruction, multiple-data stream (SIMD) or multiple-instruction, multiple-data stream (MIMD) computer architecture, multiple processors may communicate simultaneously with other portions of the computer system for data storage and retrieval, thus requiring multiple communication paths between the processors and other parts of the system. One distinct advantage of such a system is that these paths typically provide redundancy so that a failure in one of these paths may be circumvented by the use of an alternate path through the system.
FIG. 1 provides a simplified block diagram of one possible computer system 100 employing multiple internal communication paths. A first set of endnodes 102 communicates with a second set of endnodes 104 by way of a set of switches 106. Each port 112 of the endnodes 102, 104 is coupled with a similar port 112 of one of the switches 106 by way of a communication link 108. Together, the switches 106 and the communication links 108 constitute a computer system interconnection “fabric” 101 through which the endnodes 102, 104 communicate with each other. In one particular example, each of the first set of endnodes 102 may be processors, while each of the second set of endnodes 104 may include memory, I/O processors, and the like. In addition, some endnodes 102, 104 may communicate directly with each other without the aid of one of the switches 106 by way of point-to-point links 110. Collectively, the endnodes 102, 104 and the switches 106 may be collectively identified as “nodes” of the computer system 300.
In the particular example of FIG. 1, each endnode 102, 104 is connected directly to each of the switches 106 so that several alternative communication paths exist between each of the first set of endnodes 102 and each of the second set of endnodes 104. The communication paths existing at any point in time through the interconnection fabric 101 are determined by the state of each of the switches 106. In one specific example, each of the switches 106 is a crossbar switch which connects each of its ports 112 connected with one of the first set of endnodes 102 with one of its ports 112 that is connected with one of the second set of endnodes 104. In alternative computer system configurations, the interconnection fabric may contain two or more levels of switches 106, such that each of the first set of endnodes 102 is connected with one of the second set of endnodes 104 by way of two or more switches 106. In another configuration, each of the first set of endnodes 102 may be coupled directly to each of the second set of endnodes 104 without the use of a switch 106. Innumerable other interconnection fabric configurations also exist.
As can be seen in FIG. 1, the interconnection fabric 101 provides multiple potential communication paths to each of the first and second sets of endnodes 102, 104. The computer system 100 thus possesses the ability to circumvent failures in the system 100 in order to continue operating. More specifically, a failure in one of the endnodes 102, 104, switches 106, communication links 108, or communication ports 112 may be bypassed by way of an alternate path through the fabric 101. Of course, the throughput of at least a portion of the computer system 100 may be reduced, as less than the entirety of the interconnection fabric 101 is available to facilitate communication between the endnodes 102, 104 under such conditions.
Oftentimes, however, a failure of a particular endnode 102, 104 affects more than one path through the interconnection fabric 101, thus causing a blockage for a number of endnodes 102, 104 attempting to communication with each other. For example, if a particular endnode 104 is not accepting communications from another node of the computer system 100 due to an internal defect, then any switch 106 coupling that endnode 104 with other portions of the computer system 100 may be blocked from sending communications destined for the endnode 104 and other areas of the system 100. Consequently, any communications employing the particular switch 106 could be delayed or blocked as well. Progressing in this fashion, the resulting blockage could expand across major portions of the fabric 101, causing most, if not all, of the fabric 101 to be “gridlocked,” therefore disabling the entire computer system 100.