Computer networks have taken on a role of central importance in today's highly electronic society. For example, computer networks provide an interactive environment in academic and entertainment settings, and provide workers, companies and business personnel with remote access to important information, facilitating telecommuting or remote management or collaboration. More significantly, computer networks are increasingly important to industrial concerns for the operation and management of manufacturing facilities. For example, many factories for production or assembly are highly computerized, and rely on sophisticated networks of computers and computing devices to carry out process or assembly steps.
As society becomes increasingly reliant on computer networks, the consequences of network failure loom larger. For example, network failures may be responsible for loss of communication between a manager and managed personnel or processes, or between cooperating devices on a single network. Especially in an industrial setting, such failures can be costly due to lost materials or productivity, and may also be dangerous to personnel or to critical facilities.
There currently exist a number of proposed solutions to the problem of network failure. The most effective of these solutions provide redundant information pathways, such as over redundant parallel networks. Thus when one pathway becomes faulted, communications will shift to a parallel network. However, such mechanisms do not necessarily provide continued connectivity between network entities when two or more such entities have network faults on different networks, effectively eliminating all network pathways of communication. The danger of such accumulated faults exists at some level in networks of all sizes, but increases with increased network size. In particular, traditional redundant network solutions have a reasonable chance of providing the necessary protections on very small networks, because the statistical probability of multiple faults at a single point in time is small. However, as the number of interconnected network entities increases, even into the hundreds and thousands of entities in some cases, the probability of simultaneous faults on different machines on different paths increases to unacceptable levels. Similarly, as the time required to repair one network fault increases, the probability of a simultaneous fault on another machine increases. For example, Commercial Switch-based recovery algorithms such as spanning tree may consume 5 seconds or more to fix a fault, rendering unattainable ideal recovery times of about one or two seconds.
The multitude and variety of failure scenarios presented by traditional redundant networks creates a need for a more robust redundant network solution. Such a solution should accommodate many patterns of multiple simultaneous network faults without allowing complete failure of communications between singly-faulted network entities, while providing rapid recovery.