Redundancy is used in fault tolerant systems to deal with failures. When a primary component (either hardware or software) fails, a back-up component takes over the responsibilities of the primary component.
An important place where redundancy may be used is the control plane of a communication system. The control plane is responsible for monitoring events in and the status of the network, and may be responsible for monitoring service level agreements. Typically a network node has a primary control card implementing the node's control plane responsibilities. A primary control card of the network node receives requests, and retrieves any system information necessary for responding to the requests. The system information may be stored locally in a database of the control card, or may be retrieved from other components of the communication system through line interface cards. For example, a control card could receive a request for usage statistics of a line card in order to monitor service level agreements. The control card may need to poll the line cards frequently in order to respond to frequent requests for statistics.
Such systems usually include a redundant control card for each primary control card in order to provide redundancy in the event of failure of the primary control card. The redundant control card must be able to assume the responsibilities of a failed primary control card almost instantly. The primary control card and the redundant control card should also be synchronized, in that they should have access to the same system information. Both of these issues are particularly important in the control plane of communication systems, in which large amounts of system information are changing quickly, and in which the responsibilities of a failed primary control card must be assumed by the redundant control card quickly in order to deal with high volumes of traffic and high quality of service expectations.
There are two common types of redundancy that attempt to achieve synchronicity and rapid assumption of responsibilities. The first type of redundancy is hot redundancy (also called 1:1 redundancy or lock-step redundancy). In systems employing hot redundancy, completely redundant hardware and software components are used. The redundant components are used solely as back-up components, in that they do nothing during normal operation of the system. The primary component is active, and the redundant component remains inactive in the sense that it has no effect on system operation. However, the redundant component operates exactly like the active component, making exactly the same computations and updating system information. Only when a primary component fails does a redundant component become active in the sense that it actually effects the system. Hardware circuits are typically used to ensure extremely fast activity switches. Synchronicity is achieved by use of point to point communication channels to ensure that the system information accessible by the primary component is the same as the system information accessible by the redundant component.
In communication systems, hot redundancy is frequently used in voice communication systems due to the high reliability requirements. However, while hot redundancy ensures that the redundant component is always available in case of failure of the primary component, the inclusion of completely inactive redundant components is expensive. Components are doubled in number, without doubling capacity of the system in which the components are installed. The capacity of the redundant component is unused, other than to maintain synchronicity of system information with the primary component, until a failure occurs. This unused capacity is a potentially valuable resource in data communication systems.
The second type of redundancy is load sharing. In systems employing load sharing, the secondary component is used to some extent in normal operation. The use of an otherwise inactive component increases the efficiency of the system. When a failure occurs in the primary component, the redundant component takes on the added responsibility of the primary component, in addition to the tasks already being processed by the redundant component.
The secondary component does not stay in lock-step with the primary component and therefore when a load sharing redundancy system fails, it takes longer for the redundant component to assume all the responsibilities of the primary component. Also, such systems typically do have certain single point of failure scenarios which the redundancy scheme cannot handle. The occurrence of such a failure can be catastrophic.
In data communications, rather than having redundancies within each node, the emphasis has been on redundant nodes within the network and the ability to re-route around a failure.
Redundancy is used to ensure that routing information can be provided upon demand. Two responsibilities of a router employing the Open Shortest Path First (OSPF) protocol are to maintain a link state database describing a topology of the communication network, and to provide routes upon request using the stored topology of the communication network.
OSPF Routers exchange link state information in the form of Link State Advertisements (LSAs). A router floods the communication network with LSAs when the router first comes online, and typically periodically thereafter. An OSPF router also transmits an LSA if it detects a change in the network topology, for example if a neighbouring router goes down. Each OSPF router maintains a state machine for each neighbouring router. If the state of a neighbouring router is “Full”, then the router on which the OSPF module resides is in full communication with the neighbouring router. If the OSPF router does not receive a “Hello” packet from a neighbouring router before the expiry of a timer, then the state of the neighbouring router is set to “Down”. The state of the neighbouring router progresses through various states until a proper exchange of protocol packets is completed, at which time the state of the neighbouring router is set to “Full”. While the state of a given neighbouring router is not “Full”, the OSPF router does not attempt to calculate routes through the neighbouring router.
Any redundancy system within an OSPF router should ensure that an active OSPF module is synchronized with a standby OSPF module so that each OSPF module is capable of calculating routes using the same stored network topology, and will therefore calculate the same shortest path when requested. The link state database of each OSPF module would therefore have to be synchronized properly. Lack of synchronicity could arise due to delay in processing or copying LSA information from one OSPF module to another. However, general redundancy schemes involve byte-wise copying of redundancy information from an active control card to a standby control card. This could create problems in an OSPF router, since the standby OSPF module could contain meaningless (or at best, confusing) data if asked to calculate a route part way through copying of an LSA.
Additionally, the wasted capacity of hot redundancy is a particular problem in routers, since calculation of a shortest path is computationally very expensive.