The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Border Gateway Protocol (BGP) is a network protocol used in packet-switched networks for exchanging routing information between gateway hosts (each with its own router) in a network of autonomous systems. Routers employing BGP interact with peers by establishing Transmission Control Protocol (TCP) connections. A router may be peered with another router in another domain using External Border Gateway Protocol (EBGP) or with another router within a domain using Internal Border Gateway Protocol (IBGP). In either case, current implementations of BGP often enable the TCP property called RETRANSMIT_FOREVER, which is used to block TCP from tearing down the session even if there is data in the TCP retransmit queue and retransmissions are failing.
One problem with use of RETRANSMIT_FOREVER is that when the retransmission queue becomes empty, such “idle” sessions are not torn down. These idle sessions continue to exist, using up resources to track and maintain them.
One approach to addressing this issue is to provide an application level “keepalive” mechanism to detect session related problems that require the session to be terminated. This mechanism terminates a session when a specified number of successive KEEPALIVE messages are lost. In other words, if no KEEPALIVE message is received for the duration of a specific period of time, called the hold time, the session is terminated. The values of KEEPALIVE time and hold time are configurable. The default is 60 seconds for keepalive time and 180 seconds for hold time.
Unfortunately, this approach has disadvantages. In order to quickly detect peer BGP application failures, many network administrators set the hold time and the keepalive time to values in the order of a few seconds. In today's high-speed networks, however, both the defaults and the retuned values that are in the order of seconds are very long times. Thus, even with re-tuning these values to the order of seconds, the idle sessions continue to place a large burden on BGP implementations in terms of processing power and scalability of the number of BGP sessions that a router can support.
Based on the foregoing, there is a clear need for a mechanism that will enable detection of session failures with improved speed relative to conventional techniques. There is also a need for a failure detection mechanism that will not adversely affect BGP scalability.
For example, if a failure occurs in a first BGP process, TCP process, or in the network element that is hosting the BGP and TCP processes, a second BGP process (or BGP “peer”) is required to re-calculate route information and potentially notify other peers so that all peers converge on the same routing information. In conventional practice, the second BGP process becomes aware of the failure only after not receiving a KEEPALIVE message from the first BGP process within a specified time period. Typically, BGP peer can identify a failure no sooner than 60 seconds after the failure occurs.
While determining failure in 60 seconds was acceptable in early network deployments, modern networks require far faster detection and recovery when connections, processes or nodes are unavailable. The timeout interval could be shortened substantially, e.g., to one second. However, this approach would not scale in networks that have thousands of peers because the network becomes clogged with too many messages.
In large networks that consist of thousands of network elements hosting BGP, a 60-second delay is unacceptable. In combination with the time required for convergence following a failure, the time delay introduced using a conventional timeout approach is not fast enough. Thus, there is a need for a better way to detect when a protocol failure has occurred in a network element.
The use, in protocols such as TCP, of sequence numbers to reliably track and deliver data segments, creates a related problem. Specifically, in a redundant network element that has an active processor and a standby or backup processor, an approach is needed for providing an accurate sequence number to the standby processor so that the standby processor can take over the connection for the active processor.
One approach to this problem is disclosed in prior application Ser. No. 10/888,122, filed Jul. 9, 2004, “Rapid Protocol Failure Detection,” of Chandrashekhar Appanna et al., assigned to the same assignee as the present application (“Appanna et al.”). The disclosure of Appanna et al. addresses a scenario in which a TCP SYN segment carries a sequence number that does not fall within the allowed window. A restarting peer learns the sequence number that will be acceptable to the peer by soliciting a TCP ACK segment for the earlier SYN, which carries an acknowledgment value, and then generating a RST segment that will carries the acknowledgment value as the sequence number. Hence a total of three segments are required, which delays notification about a protocol failure. The amount of delay is directly proportional to the round-trip time of the link on which the traffic is sent, and also causes extra traffic to be generated.