§ 1.1 Field of the Invention
The present invention concerns ensuring that failover mechanisms in a device are not compromised to an unacceptable degree. More specifically, the present invention concerns monitoring failover mechanisms and, upon detecting actual or imminent or likely failure of a failover mechanism, reporting the failure.
§ 1.2 Related Art
The description of art in this section is not, and should not be interpreted to be, an admission that such art is prior art to the present invention.
Use of high-availability devices is especially important in real-time or mission-critical environments where outages can be devastating. For example, in telecommunications “five nines” availability is often required, meaning that the data transport mechanism must be up and running 99.999% of the time. Network equipment in particular must be designed for high availability, and is therefore often built using failover elements.
A high-availability device often includes failover elements that are in a standby mode. The failover elements are used when one or more primary elements in the device fails. A failover element may be identical to the primary element, thus providing full redundancy of the primary element, or may be an element that, although not identical to the primary element, serves to support the functions of the primary element when the primary element fully or partially fails. A single failover element may serve to failover one or more primary elements.
One example of network equipment requiring high availability is data forwarding devices, such as routers and switches. A basic function of these devices is to forward data received at their input lines to the appropriate output lines. Switches, for example, may be configured so that data received at input lines are provided to appropriate output lines. Switches are typically used in circuit-switched networks, in which a “connection” is maintained for the duration of a “call” (a communications session between two devices). If one or more elements in the switch fails and there is no failover element for the failed primary elements, the circuit may be broken and would need to be set up again.
In packet switched networks, addressed data (referred to as “packets” in the specification without loss of generality), are typically forwarded on a best efforts basis, from a source to a destination. Many packet switched networks are made up of interconnected nodes (referred to as “routers” in the specification below without loss of generality). The routers may be geographically distributed throughout a region and connected by links (e.g., optical fiber, copper cable, wireless transmission channels, etc.). In such a network, each router typically interfaces with multiple links.
Packets may traverse the network by being forwarded from router to router until they reach their destinations (specified by, for example, layer-3 addresses in the packet headers). Unlike switches, which establish a connection for the duration of a “call” or “session” to send data received on a given input port out on a given output port, routers determine the destination addresses of received packets and, based on these destination addresses, determine for each packet the appropriate link or links on which the packet should be sent. Routers may use protocols to discover the topology of the network, and algorithms to determine the most efficient ways to forward packets towards a particular destinations. Packets sent from a source to a particular destination may be routed differently through the network, each taking a different path. Such packets can even arrive out of sequence.
Network outages may occur when elements (such as elements of network nodes, as well as links between such nodes) fail in the network. Failover elements may prevent an outage, but if both a primary element and its failover fail, the outage may occur. Failover mechanisms may become compromised in a variety of ways. The failover element may have either failed earlier and still been resident in the system (for example, if a primary element was not replaced after the system switched to the failover element), or may have failed months or weeks earlier and been removed for replacement but not yet been replaced. Consequently, what would otherwise be considered to be a robust and protected part of a system can become compromised to such a degree as to actually become a single point of failure. Such failures are avoidable since they are often the result of a breakdown in operations practices and procedures. As the foregoing examples illustrate, it is often unknown how much failover is present in communications networks.
Accordingly, there is a need to ensure that failover mechanisms are not compromised to an unacceptable degree in devices.