Many modern communication networks include routers that inter-connect various paths in the network. Routers generally include tables which provide a map of routes through the network. The technology for routing messages through networks is well know. For example, see books such as “Designing Routing and Switching Architectures” Network Architecture and Development Series, by Howard C Berkowitz, Published by Que; 1st edition Nov. 15, 1999, ISBN: 1578700604 or “OSPF Anatomy of An Internet Routing Protocol” by John T. Moy, published by Addison-Wesley Pub. Co.; 1st edition, Jan. 15, 1998 ISBN: 0201634724.
Reliability is of primary importance in modern day communication systems. Reliability is often increased by the use of stand-by routers which are brought into operation when a primary router fails.
When a router becomes inoperable, a new map of the paths through the network must be calculated and propagated to all routers in the network. There are known protocols and techniques for doing this type of re-routing such as the “Link State Routing Protocol” or the “Distance-Vector Routing Protocol”. Using these protocols, routers talk to adjacent routers, informing each other of what network routes are currently active.
Many different types of failures can occur in a system. However, one of the most common failures is a software failure. A software failure occurs when, for some reason, the software in a unit stops operating properly. In many systems when a software failure occurs, the system branches to an exception handler routine. The exception handler routine is an independent program thread of execution, that generally performs a number of operations that facilitate handling and post-mortem analysis. For example the exception handler may perform a memory dump so that programmers can determine what caused a software failure.
Communication systems that include backup routers, usually include a mechanism to detect software failures in the primary unit. When a software failure is detected by these mechanisms, operation of the backup unit is intiated. In currently available systems, there are a variety of different types of mechanisms for detecting failure and activating backup units.
Some systems include a hardware implemented mechanism for detecting software failure and for activating a backup router. A hardware failure detection mechanism may for example include a special signal line that activates a standby unit when a software failure occurs in a primary unit.
For systems that do not include a hardware failure detection mechanism, there are several known types of failure detection mechanisms in widespread use. One known type of software failure detection uses a simple time out mechanism. For example, a primary unit can be programmed to periodically send a signal to a standby unit (for example every 1 to 30 seconds). If the standby unit does not receive this signal within a defined period, it concludes that the primary unit has failed and the backup unit goes into operation. This type of failure detection is sometimes called a “heart-beat” method. Another type of known failure detection can be termed “hello-acknowledge”. When a “hello-acknowledge” methodology is used, the backup unit (or a central unit) periodically polls the main unit. If a response is not received in a specified period, the system concludes that the primary unit is not operating.
With the known types of software failure detection mechanisms there can be a delay between when the failure occurs and when the backup unit detects the failure. For example with “heart-beat” systems, there is a period of time between heart beats. While this period of time may be quite short (i.e. 1 to 30 seconds) with a communication system, much data can be lost in short period of time.
The present invention is directed to a proactive software mechanism for detecting failure and for activating a backup unit.