In order to achieve high availability, many services or applications operate within a clustering environment, such that multiple independent devices operate in parallel. If a failure occurs on one device, this allows there to be a secondary service to take over the responsibility of the failed device. FIG. 1 is a diagram illustrating an example of a cluster. As can be seen, there are two cluster members (devices 100, 102), and two networks connected to the cluster (networks 104 and 106). In practice, there can be any number of cluster members and networks.
The clustering environment can be configured in many different ways, including active/stand-by, load sharing, or load balancing topology, depending upon the network or application requirements. Devices within the cluster are connected using some internal mechanism, and communication between cluster members may be available. However, even if it is not, the cluster entity is capable of recognizing a failure of a member and taking appropriate action.
Logically, the cluster is represented as a single entity to the outside world, which in the case of networking includes the attached networks. Neighboring devices “see” only a single entity (the cluster), with which they communicate. This permits the neighbors to be unaware of a failure, because characteristics such as IP address belong to the cluster, not the individual (failed) devices that make up the cluster.
Many types of applications and services that use clustering also require routing in their networks. As a result, it is necessary to add routing capability for the cluster itself, so that the applications or services have the necessary information (e.g., routes) to operate properly within the network.
Dynamic routing occurs when routing components talk to adjacent routing components, informing each other to which network each routing component is currently connected. The routing components must communicate using a routing protocol that is running by an application instantiating the routing function, or a routing daemon. In contrast to a static protocol, the information placed into the routing tables is added and deleted dynamically by the routing daemon as the routes in the system change over time. Additionally, other changes can occur to the routing information over time. For example, route preferences can change due to changes in network conditions such as delays, route addition/deletions, and network reachability issues.
Open Shortest Path First (OSPF) is a link-state protocol that implements dynamic routing on routing components. In a link-state protocol, each routing component actively tests the status of its link to each of its neighbors, and sends this information to its other neighbors. This process is repeated for all the routing components for nodes in the network.
Each routing component takes this link-state information and builds a complete routing table. This method can be used to quickly implement a dynamic routing system, especially in the case of changes in the links in the network.
The clustering environment imposes some limitations on the routing and/or signaling protocols used by the cluster to communicate with the neighboring devices. First, the protocols must communicate with the attached networks using the cluster addressing scheme. Private addresses assigned to the individual devices that make up the cluster must not be shared outside the cluster. Second, since neighboring devices know of only a single entity (the cluster), only one member within the cluster may be performing route exchange with neighbors at any given time (using the cluster address). If multiple devices attempt to communicate externally using the same addresses, network problems will result.
One solution that has been proposed is for the clustering environment to use protocol synchronization to synchronize the data structures and all internal data from each routing protocol on the active device to the backup device(s). The idea is that during a failure, the backup routing protocol can come online and begin communication with the neighboring devices as if nothing has occurred. The only real advantage to this solution is that traditionally, legacy high availability (HA) is achieved by mirroring the primary to the backup device in every way. Therefore, users who are familiar with traditional HA and not familiar with routing may feel comfortable with this solution. The disadvantage, however, is that it is a very complex, problematic, and unpredictable solution that has high impact to the cluster members and the internal cluster network. Since routing/signaling protocols were not designed to run in this way, the feasibility of this design is suspect. More importantly, however, in this solution the neighboring routing devices detect the failure of the active routing device, and subsequently rebuild their routing tables with the new information, which is hardly a seamless transition. In large networks, the number of neighboring devices and sizes of their routing tables are quite high, therefore adding significant burden on the network during a failover scenario.
Another solution that has been proposed is to introduce a high-end router to the cluster that can support equal-cost load balancing. The new cluster router (CR) is responsible for performing all routing communications with external network devices on behalf of the cluster addresses. Each cluster member runs standard OSPF to facilitate route exchange with the CR. The CR performs equal-cost load balancing across all of the cluster members. The cost and complexity of this solution, however, are both quite high. Additionally, the CR represents a single point of failure that places network functioning at risk.
What is needed is a solution that provides routing capabilities in a clustering environment in an efficient and effective manner.