The invention relates to a system and multi-thread method to manage a fault tolerant computer switching cluster using a spanning tree.
In the rapid development of computers many advancements have been seen in the areas of processor speed, throughput, communications, and fault tolerance. Initially computer systems were standalone devices in which a processor, memory and peripheral devices all communicated through a single bus. Later, in order to improve performance, several processors and were interconnected to memory and peripherals using one or more buses. In addition, separate computer systems were linked together through different communications mechanisms such as, shared memory, serial and parallel ports, local area networks (LAN) and wide area networks (WAN). However, these mechanisms have proven to be relatively slow and subject to interruptions and failures when a critical communications component fails.
One type of architecture of many that has been developed to improve throughput, allow for parallel processing, and to some extent, improve the robustness of a computer network is called a hypercube. Hypercube is a parallel processing architecture made up of binary multiples of computers (4, 8, 16, etc.). The computers are interconnected so that data travel is kept to a minimum. For example, in two eight-node cubes, each node in one cube would be connected to the counterpart node in the other. However, when larger numbers of processors and peripheral devices are included in the network, connecting each node, which includes processors and peripheral devices, to all other nodes is not possible. Therefore, routing tables for data must be established which indicate the shortest path to each node from any other node.
A hypercube like architecture, and many other types of networks and computer architectures, work well when all the components are operating properly. However, if a failure occurs to a node, switch, bus or communications line, then an alternate path for data will have to be determined and the routing or distance table would have to be computed again. If this failure occurs to a centrally located node, switch, or communications links, then the impact to the network would be more significant and in some configurations, possibly as much as half the network would not be able to communicate to the other half. Such a situation may require taking the network offline and reconfiguring the communications links as well as computing a new routing or distance table. Of course, taking a network offline or losing communications to a portion of a network is highly undesirable in a business, academic, government, military, or manufacturing environment due at least to the loss in productivity and possible even more dire consequences.
Therefore, what is needed is a system and method that will, upon initial set up of a computer network, determine the optimal routing of data for any configuration of a computer network having any number of processors, computers and peripherals, referred to as nodes, so as to create the shortest possible distances between nodes. Further, this system and method should, upon the detection of a switch or node failure, be able to identify a substitute link which has the least impact on the network and the routing or distance table used to transmit data. The system and method should also be able to switch to the substitute link with minimal impact to the operation of the network and without taking the entire network offline.