The present invention is directed to networks or clusters of computing apparatuses, and especially to particular aspects of operation of such computer networks, such as recovery from a fault suffered by a member computer of the network. In this context a recovery includes continuity of full-service (or as near to full-service as can be attained under the circumstances) operation of a network of computing apparatuses while repairs are effected, and subsequent restoration of a repaired computing apparatus to participation in service provision for the network. The present invention is useful in operating computer networks in any environment or in any technical application, and is particularly useful in operating computer networks associated with wireless telecommunication system. One skilled in the art of computer network design will recognize that no aspect of the invention limits its employment to telecommunication applications.
Grouping computers into clusters or networks is one method for providing higher levels of availability for computers. Such a connection arrangement provides a structure by which failure of any one of the computers in the network can be compensated for by one or more of the remaining member computers in the network. Management and control of such a cluster of computing apparatuses is carried out by an entity known as a “watchdog”. A watchdog can be implemented in hardware or in a set of collaborating software processes or programs. One of the software entities among the set is preferably the primary controller of recovery activities and remaining software entities are secondary or backup controllers configured to assume watchdog operation control in the event the primary software entity fails.
A watchdog implemented in hardware is typically embodied in a single, ultra-low Mean Time Between Failure (MTBF) unit. The unit is usually an autonomous unit in a separate locus in the system, but could as well be included in the cabinetry of one or more of the computing apparatuses in the network. The point is that the hardware watchdog is substantially implemented in hardware and is therefore more robust and reliable than a software implementation. Some software is included to operate the hardware, but sensing inoperative computing apparatuses, switching operations among computing apparatuses, and other activities vital to recovery operations (i.e., continuity and restoration) over which the hardware watchdog has control are implemented and executed in hardware. When the watchdog is substantially implemented in hardware it is typically a much faster entity than is provided by a software watchdog entity. One problem inherent in hardware watchdog entities is that any failure leaves the cluster without centralized control for recovery operations until the failed hardware watchdog unit can be repaired or replaced. During the period the hardware watchdog is being repaired or replaced there is a window of vulnerability during which there is no way for the network to recover from a failure by any other computer in the cluster or network. In such circumstances any service provided by the cluster may be severely and detrimentally affected. One may provide multiple hardware watchdog units to overcome this vulnerability, but that is a complex and costly solution.
Another solution is to provide watchdog protection using a high priority software process having replicas distributed across the cluster of computers. Such an arrangement avoids the catastrophic failure risked with a single hardware watchdog setup, and it avoids most of the complexity and expense of providing additional hardware watchdog units. However, the overall availability among computers is less with a software watchdog implementation than can be provided by a hardware watchdog unit because the software entities commonly exhibit higher failure rates than are exhibited by hardware implementations.
The inventors have developed a two-tier watchdog apparatus and method for effecting recovery of a network of computing apparatuses. According to the preferred embodiment of the invention, a hardware watchdog entity provides primary control of continuity operations (i.e., shifting services from an inoperative computer to operative computers) and recovery operations (i.e., returning services to a computer after it is restored to operation following a failure). In the event that the hardware watchdog unit fails, a set of software watchdog entities assume control of continuity and recovery operations. The two-tier protection provided by the apparatus and method of the present invention is significantly more reliable and provides a more robust computer clustering system than is provided by reliance solely upon a hardware or solely upon software watchdog system.
Typical computer clustering systems can only protect applications from a single point of failure. That is, a hardware watchdog can fail or a computing apparatus in the cluster (i.e., a network processing node) can fail, but if both the hardware watchdog and a network node fail, service will be adversely affected. If a hardware watchdog system fails, there can be no recovery from subsequent node failures until the hardware watchdog unit is replaced or otherwise rendered operational. With the two-tier watchdog apparatus and method of the present invention, a hardware watchdog can fail and (n−1) processing nodes in a network may fail (where n is the number of nodes in the network) and limited service can still be provided using the remaining operational nodes. The capability to provide at least some level of service down to the “last node standing” extends failure coverage and increases overall availability and reliability of a cluster to a significant degree.
There is a need for a watchdog recovery control system for computer cluster networks that is improved in its flexibility over prior art watchdog control systems.
There is a need for a watchdog recovery control system for computer cluster networks that is improved in its robustness over prior art watchdog control systems.
There is a need for a watchdog recovery control system for computer cluster networks that is improved in its reliability over prior art watchdog control systems.