In a communication network, “failover” is the capability to switch over automatically to a redundant or standby network component or communication link or pathway (e.g., computer server, router, controller, optical or other transmission line, or the like) upon the failure or abnormal termination of the previously active component or link. In operation, one or more designated network components are electronically monitored on an ongoing basis. Upon failure of one of the designated components, the system automatically carries out a failover operation for switching to a redundant or standby system, without the need for human intervention. Failover capability is usually provided in servers, systems, or networks where continuous availability and a high degree of reliability are required, such as in wireless communication networks.
Although failover operations preserve overall network integrity by maintaining communications across the network, in certain systems they may result in network congestion, a decrease in data throughput, dropped calls, and the like. Failure in a multilink group (a multilink group is a set or grouping of data links or nodes) typically causes congestion in forwarding data packets over that hop, including the delay of time sensitive air interface data. Such congestion can last much longer than the network can support. In an IP (Internet protocol)-based wireless network, e.g., a “1x-EVDO” (Evolution Data Optimized, or Evolution Data Only) network, path redundancy is normally built using IP routes and standby/backup devices. Upon failover to such redundant components, the IP routes are typically divided into two fractions. Each fraction notices congestion.
An example of the effect of failover on network congestion and performance is shown in FIG. 1 for an IP-based wireless network utilizing automatic protection switching (APS). APS is a failover protection mechanism for optical networks, e.g., the optical backhaul portion of a wireless network connecting one or more base stations to a radio network controller or the like. The network operates in a congestion monitoring and recovery mode, in a standard, ongoing manner, to maintain a designated minimum quality level for a maximum number of calls. (The quality level may be assessed in terms of a target frame error rate, or in terms of other designated parameters.) In this mode, traffic conditions are monitored, and communications are controlled (e.g., call admissions and drops, soft handoff, and the like) based on the traffic load and/or instantaneous available bandwidth. Thus, network load may be dropped if required to maintain acceptable levels of voice quality, according to a designated transmission hierarchy. For example, it may be the case that data-only transmissions and the handoff legs of voice calls are dropped before other, higher-priority transmissions.
The available bandwidth 10 for a multilink group in the network is shown in FIG. 1 as a function of time, as is the aggregate traffic 12 and the signaling traffic volume 14, which is normally carried over TCP (transmission control protocol) or a similar protocol. At time T1, an APS failure occurs, such as the failure of an optical backhaul circuit or the like. Subsequent to time T1, the instantaneously available bandwidth 10 decreases as the system automatically commences switching from the failed optical circuit to a redundant circuit. Due to the time lag in detecting and switching from the failed component to the redundant component, there is a temporary decrease or interruption in the network resources for the multilink group, resulting in a concomitant reduction in effective bandwidth. At time T2, the available bandwidth 10 falls below the traffic volume 12, resulting in dropped traffic 16 as the network compensates for the reduced bandwidth according its congestion monitoring and recovery functionality. Although the available bandwidth is reduced, it is typically the case that a certain minimum amount of bandwidth 18 is retained, e.g., the bandwidth associated with a single T1/DS1 line or circuit. (A DS1/T1 circuit is made up of twenty-four 8-bit channels or timeslots, each channel being a 64 kbit/s multiplexed pseudo-circuit.) At time T3, for a stateless failover, the multilink group is renegotiated for adding a DS1 circuit through the redundant system component(s), resulting in a stable bandwidth being achieved at T4 and dropped transmissions being setup automatically where possible. Generally speaking, a “stateless” failover is one where (i) the standby component assumes the communication addresses of the failed entity, e.g., IP and MAC addresses, but (ii) any open connections are closed, requiring renegotiation and reconnection with the standby component. In stateful failovers, on the other hand, the primary and standby components exchange state information over a private link so that applications with open connections during a failover do not have to reconnect to the communication session.
Because failover operations result in decreased bandwidth and a perceived congestion condition, and because the network normally handles congestion by dropping the load at the source, failover operations may result in dropped calls. This is the case even though the congestion is temporary and not a result of actual aggregate traffic load in the network. In particular, the recovery process starts at time T2, when congestion begins. In some systems, the congestion recovery process requires a handshake over the route, such as in wireless networks where internal resources are released on both ends of the network. Such handshakes may not arrive in time because of the congestion, meaning that recovery may extend beyond the time of failover, e.g., past time T4. Delayed recoveries can lead to unnecessary traffic and/or call drops, which impact service conditions for both the end user and service provider.