Fault-tolerant and high availability processing systems are known. These systems are used in applications requiring high reliability and extremely low downtime. Exemplary applications for fault-tolerant or high availability systems include telecommunications applications, such as switching technology used in wire line and wireless telephone switching applications.
Computer-based distributed client-server type systems that are consistent with the ITU network model are being deployed in high-availability processing systems, such as telecommunications applications. Network elements, such as base stations in a wireless application, manage the access network resources, such as radios, channels, etc. Application processors (APs) make request of the network elements in order to fulfill their functions. Exemplary APs perform functions such as call processing, signaling, data traffic processing or the like.
Computer-based high availability systems typically require a relatively large amount of space or real estate. It is desirable to reduce the space required for high availability systems. Also, the cost must be reduced. These constraints are pushing telecommunications service providers to distribute applications across diskless commercial processors. Diskless commercial processors provide the price and performance needed for high availability systems such as telecommunications switching systems, but provide some reliability challenges.
One exemplary commercial high availability system includes a network interconnecting cluster groups of processors. Each cluster group has processors, including at least one boot processor and at least one satellite CPU. Typically the satellite CPUs are diskless. The boot processor includes a disk. The processors in the cluster groups run an operating system, such as UNIX, with a network file sharing (“NFS”) feature. The network file sharing feature permits the processors in the same NFS group to seamlessly share disk storage. The diskless processors are booted with NFS, even though the processors may not have a disk directly attached to the processor. Each cluster group, which, in the case of NFS, is called an NFS group, typically includes a power system, cooling system, housing, and other common support functions. The common support functions reduce cost by spreading overhead among multiple processors. However, the common support functions are a single point of failure, which in large configurations, i.e., many processors, creates undesirable, large failure groups. High availability common support functions, such as N+K sparing of power supplies, fans, etc., increase availability, but also increase cost.
Software-based application processors are arranged to take advantage of the N+K processing power. Within a single cluster group running the NFS feature, applications run multiple software instances on one or more clients. A failure in a client is not fatal. However, a failure in the common support functions or boot processor of the single cluster group is fatal. Cluster group networking, where two or more cluster groups or NFS groups are connected over a network, (i.e., spans multiple NFS groups) is used to prevent a single failure in a cluster group from being fatal by providing at least one backup cluster group, i.e., a different NFS group.
In cluster group networking, the network should not be a single point of failure. Therefore, multiple access points to the network and independent network connections should be maintained. Even where there are two or more physical network access points for a single processor, for example, multiple network cards and network mediums, some network software requires that a single software-stack be maintained on the processor. TCP/IP, a defacto standard in network software for IP-based systems, is a network software application that permits only one software stack per processor. The single stack is a potential single-point of failure. In order to avoid this single point of failure, another processor is provided to, among other things, add another network connection with another network stack. The additional processor has a separate path to at least one other processor and preferably to a plurality of processors. That is, the additional processor has a separate path or interface connecting it to the processor(s) that have the single software stack, which processors are typically in the same networking group or NFS group. This separate path is preferably more tightly integrated to the processor(s) and additional processor. That is, less software overhead and protocol is required for monitoring and control between the additional processor and the processor(s) tightly coupled to the additional processor. The tightly coupled path preferably provides capability for monitoring a processor “healthy” signal in hardware and controlling a signal to reset or reboot the processor. The additional processor is often called and serves as an alarm card, maintenance card, an alarm and maintenance card, chassis management card or watchdog card. This arrangement permits the processors in the system to collectively determine when a processor or communication path has a fault. Detection methods and recovery algorithms to provide the highest reliability are necessary to exploit this arrangement. The present invention provides such novel detection and recovery algorithms to provide extremely high availability.