The continued demand for high performance computers and/or computer systems requires optimum usage of the available hardware and software. One such approach is the implementation of the use of processing nodes each comprising one or more microprocessors and memories. These computer systems are sometimes referred to shared multiprocessor systems. In a shared multiprocessing computer system, the nodes are interconnected to each other so that they can communicate with each other, share operating systems, resources, data, memory etc.
One of the goals of building a modern computing machine employed at the Enterprise level is to: have enough system capacity to take the many different workloads and applications running in a distributed computing environment, such as a server farm, and move them onto a large highly available monolithic host server, which is still operable or available while maintenance or capacity upgrade is being performed on the host server.
The benefit of consolidating workloads and applications from many small machines to a larger single one is financially motivated to reduce the number of system operators, amount of floorspace and system maintenance costs. However, the risk associated with such a consolidation is when an unplanned system outage occurs and the entire processing center could possibly be shut down. Up until now, system integration vendors have focused mainly on pushing the symmetric multiprocessor computer system (SMP) size envelop, integrating up to 64 or more processors in a tightly coupled shared memory system in a variety of coherent inter-processor connect topologies. The commonly available designs in the Unix platform include topologies where integrated processor-memory nodes, or simply nodes, are interconnected by means of multiple parallel common directional loops with distributed switch network (topology A), Central crossbar switch (topology B), or tree-based hierarchical switch (topology C). All of the above-mentioned topologies can be built to achieve the large scalability goal of a modern computing machine, but do not completely achieve the system availability goal in terms of dynamically replacing or repairing a hardware component while the rest of the system continues to operate.
Should any failing hardware component need to be replaced on any of the 3 topologies mentioned above the system would be either severely degraded or rendered inoperable. For example, a failing node anywhere on topology A would prohibit inter-processor communication from being maintained amongst the remaining nodes. On topology B shutting down and replacing the central crossbar switch would essentially bring down all inter-processor communication between nodes. Even if a parallel central crossbar switch were to be added as a fail-over network the practicality of such a design would make it impractical to package the crossbar switches anywhere else except on the backplane where the nodes are plugged into, thereby shutting down and replacing a switch component would require the replacement of the backplane which would have the effect of losing all node to node communication. If the failing node on topology C is at the apex of the tree network, inter-processor communication would not be possible between branches beneath the apex node. A fourth possible topology would be a fully connected star-based scheme with all nodes connecting directly to each other. While this topology would be ideal for dynamic node replacement the main drawbacks are the limited bus bandwidth on configurations with few number of nodes, and the high latency frequency limited noise susceptible long wiring nets needed for a fully connected topology.
Accordingly, it is desirable to have a shared multiprocessing computer system wherein critical hardware components can be substituted out for maintenance purposes while remaining components continue to operate and thereby virtually eliminating system downtime due to planned outages for servicing of hardware.