FIG. 1 shows an exemplary enterprise telecommunications system. The system 100 includes a primary server 104 providing call control functionality, first, second, . . . nth gatekeepers 108a-n connected to the primary server 104 to provide network administration, and a number of endpoints 112a-n connected to a respective gatekeeper. As used herein, a “gatekeeper” is a computational component that administers traffic flow by performing various functions, such as terminal and gateway registration, address resolution, bandwidth control, admission control, and the like. Every endpoint has an IP address, either a permanent one assigned to a particular network card or a temporary one that is assigned at network login time via a mechanism such as the Dynamic Host Configuration Protocol (DHCP). The server 104 can contain a common database to allow the gatekeepers to share state information. An alternate server 116, such as an Enterprise Survivable Spare processor (ESS) or Local Survivable Processor (LSP), provides redundancy for the endpoints in the event that connectivity is lost with the primary server 104. As will be appreciated, the gatekeeper functionality can co-reside in the server, with the gatekeepers simply providing a front end, or the server can provide a shared database without any gatekeeper functionality.
To make an endpoint eligible to receive service, endpoints must discover/register with a GateKeeper (GK). Registration is done over a Uniform Datagram Protocol or UDP-based Registration, Admissions, and Status or RAS channel. As part of registration, the endpoint is authenticated, receives an Alternate Gatekeeper List or AGL with gatekeeper addresses to failover to if its current gatekeeper fails, and receives a time-to-live parameter within which the endpoint must renew its registration. FIG. 4 shows the registration process as defined by the H.323 protocol. A gatekeeper request is first sent by the endpoint to the primary server/gatekeeper requesting the gatekeeper to service the endpoint. The gatekeeper then responds with a gatekeeper confirm (shown) or reject (not shown) message. When the endpoint receives a gatekeeper confirm message, the endpoint responds with a registration request including, inter alia, the endpoint's IP address, extension, or alias (provided by the user in the endpoint H.323 application). When the registration is successful, the gatekeeper responds with a registration confirm message.
To bring the endpoint into service, a call signaling channel must be established between the endpoint and the gatekeeper/primary server. A Transmission Control Protocol or TCP-based Call Signaling (CS) channel (which is different from the RAS channel), established between an endpoint and its gatekeeper, is commonly used to exchange various call signaling messages including those pertaining to call setup, call termination, capabilities exchange, etc. This channel, initiated by an endpoint or a gatekeeper based on need, may be established at the time of registration or at the time of a call. When established at the time of a call, the channel commonly lasts for only for the call's duration. In one configuration, the channel continues to persist after the call is ended. It may be established between an endpoint and its gatekeeper in gatekeeper-routed call signaling or between calling endpoints in direct endpoint call signaling. The messages and procedures used on the RAS and CS channels are defined in ITU-T H.225.0. Once registered, endpoints may be considered to be in-service without requiring re-registration or CS channel establishment.
An important aspect of the architecture of FIG. 1 is load balancing the CS channels of the endpoints to distribute the channels uniformly among the gatekeepers. The CS channel connections initiated by gatekeepers are easy to load balance because the gatekeeper has information regarding the current load on each gatekeeper. However, this is not true for endpoint-initiated connections. Such CS channels can be hard to balance. The number of CS channels at a gatekeeper is constantly changing as calls are made and due to network and other failures. In this dynamic environment, the endpoints do not typically have current information regarding the load on a particular gatekeeper.
Several techniques have been employed to address channel distribution among gatekeepers. First, some products do not even attempt to load balance. This will often lead to an uneven load among gatekeepers, with some getting overloaded while others are only lightly loaded. Second at the time of registration, either gatekeeper load information is sent explicitly to the endpoints or the gatekeeper addresses are specified in increasing order of load. However, the load information is likely to be stale when the endpoint needs to establish the CS channel. Third at the time of registration, gatekeeper addresses can be sent in random order to the endpoints. This approach may work if there are a large number of gatekeepers and no failures. It will not work well in a realistic setting where failures periodically occur. Failures of gatekeepers will cause endpoints to migrate to other gatekeepers. When the failed gatekeepers recover, the endpoints will be unevenly distributed. However, the endpoints will still randomly connect to gatekeepers as if the gatekeepers had uniform distribution of endpoints. Finally when an endpoint tries to establish a CS channel with a gatekeeper, the gatekeeper redirects the endpoint to connect to the least loaded gatekeeper. This solution may work in certain applications but it is inefficient.
Another important aspect of the architecture of FIG. 1 is the use of a heartbeat mechanism to determine when a gatekeeper fails or becomes unreachable so that an endpoint can receive service from an alternate gatekeeper. It is desirable that this failover to an alternate gatekeeper be performed expeditiously so that continuity of service can be maintained for users. If such a failure occurs when the CS channel is not established, it can take a long time for an endpoint to detect failure. Most likely the failure will be discovered when an attempt is made to originate or deliver a call. Thus, failure recovery must be performed as a call is waiting for a user or as a user is dialing digits. In some cases, a timely recovery may be possible but, frequently, this will lead to dropped calls, calls going to a coverage path, or users unable to make a call. Accordingly, it is important that failures be detected and rectified in a prompt and efficient manner.
Several techniques have been employed to address network failures. First, some products do not perform a heartbeat functionality. The CS channel can be recovered as needed (e.g., when a user wants to make a call), regardless of when the failure occurs. However, in come cases the endpoint may not be able to find another gatekeeper in a timely fashion, thereby causing a brief outage. Second, the CS channel may be established immediately at startup and kept up at all times. This approach will work if the CS channel could be established for all the endpoints immediately after registration. However, establishing the CS channel for all the endpoints (especially when their number is large) at startup (or after a major failure) is not scalable since it can cause overload conditions at the gatekeepers. Depending on the number of endpoints, it can take tens of minutes to hours for the CS channel to be established for all endpoints. Thus, endpoints that actually need to use the CS channel (i.e., endpoints making or receiving calls) may be denied service during this time.