In recent years the Internet has been transformed from a special purpose network to an ubiquitous platform for a wide range of everyday communication services. The demands on Internet reliability and availability have increased accordingly. A disruption of a link in central parts of a network has the potential to affect hundreds of thousands of phone conversations or TCP connections, with obvious adverse effects.
The ability to recover from failures has always been a central design goal in the Internet. See D. D. Clark, “The Design Philosophy of the DARPA Internet Protocols,” SIGCOMM, Computer Communications Review, vol. 18, no. 4, pp. 106-114, August 1988. IP networks are intrinsically robust, since IGP routing protocols like OSPF are designed to update the forwarding information based on the changed topology after a failure. This re-convergence assumes full distribution of the new link state to all routers in the network domain. When the new state information is distributed, each router individually calculates new valid routing tables.
This network-wide IP re-convergence is a time consuming process, and a link or node failure is typically followed by a period of routing instability. During this period, packets may be dropped due to invalid routes. This phenomenon has been studied in both IGP (A. Basu and J. G. Riecke, “Stability Issues in OSPF Routing,” in Proceedings of SIGCOMM 2001, August 2001, pp. 225-236) and BGP context (C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian, “Delayed Internet Routing Convergence,” IEEE/ACM transactions on Networking, vol. 9, no. 3, pp. 293-306, June 2001) and has an adverse effect on real-time applications (C. Boutremans, G. Iannaccone, and C. Diot, “Impact of link failures on VoIP performance,” in Proceedings of international Workshop on Network and Operating System Support for Digital Audio and Video, 2002). Events leading to a re-convergence have been shown to occur frequently, and are often triggered by external routing protocols (D. Watson, F. Jahanian, and C. Labovitz, “Experiences with monitoring OSPF on a regional service provider network,” in ICDCS '03: Proceedings of the 23rd International Conference on Distributed Computing Systems. IEEE Computer Society, 2003, pp. 204-213).
Much effort has been devoted to optimizing the different steps of the convergence of IP routing, i.e., detection, dissemination of information and shortest path calculation, but the convergence time is still too large for applications with real time demands. A key problem is that since most network failures are short lived (A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, and C. Diot, “Characterization of failures in an IP backbone network,” in Proceedings of INFOCOM 2004, March 2004) too rapid triggering of the reconvergence process can cause route flapping and increased network instability.
The IGP convergence process is slow because it is reactive and global. It reacts to a failure after it has happened, and it involves all the routers in the domain.
In “FROOTS—Fault Handling in Up*/Down* Routed Networks with Multiple Roots”, by Ingebjørg Theiss and Olav Lysne, one of the con-inventors of the present application, Proceedings of the International Conference on High Performance Computing HiPC 2003, Springer-Verlag, the contents of which are incorporated herein by way of reference, there is disclosed an improved system for handling faults in networks. A number of virtual configuration layers are created, and each node is made safe in one of these layers, i.e. by being made a leaf node in that layer so that data is not directed through that node to any other node. When a node fault is registered by the source node, the source node can choose a safe layer for the faulty node, and then transmit data according to the new configuration defined in that layer. If there is a faulty link, there will be two nodes attached to the faulty link, and a safe layer is chosen for one of these nodes, on an arbitrary basis. It is possible using the system described to have a relatively small number of layers, and for example it is shown that for a network with 16 k nodes and 64 k links it is possible to obtain coverage using a maximum of 8 layers.
Such an arrangement has advantages over previous systems, but in practice there may be delays in switching to an alternative layer when there is a fault in a node or a link, and there are some limitations as to the versatility of the system.