The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
For network management purposes, a network element cluster may be defined as a first network switch, a plurality of network elements, some of which are actively processing or routing data and others of which are held in a backup pool for use in the event of a failure, and a second network switch. If one of the active network elements experiences a transient or permanent failure, proper operation of the network requires taking the failed network element offline and substituting one of the backup network elements. Because the first and second network switches have numerous logical connections established with the failed network element, such substitution also requires re-configuration of both the first and second network switches so that the connections reference the substituted backup network element.
At present, the process of substitution and re-configuration is manual and requires an unacceptably long amount of time, ranging from hours to days. There is no automated method for performing the substitution and re-configuration, which are collectively termed “re-provisioning” herein. Industry service providers have a great need for improved methods and system that can solve this problem.
In one specific industry context, Cisco 7400 ASR from Cisco Systems, Inc. offers enterprise and service provider customers a cost-effective, single-platform solution with the high performance, density, availability, and scalability to be deployed across the network from enterprise to POP environments. By leveraging the multifunction capabilities of the Cisco 7400 ASR, a customer can simplify its network architecture, significantly reducing costs and increasing revenue opportunities through value-added services.
In particular, as the business of a service provider grows, a group of 7400 devices may be clustered into a single logical resource for administrative and management simplicity. The cluster typically is logically interposed between two switch devices that are associated with different networks, such as a metro ATM switch and an Ethernet switch. Commercial examples of ATM switches include the LS1010, Catalyst 8510, and Catalyst 8540 from Cisco Systems. Commercial examples of Ethernet switches include the 2948G and 3500XL from Cisco Systems. The ability to provision and manage a cluster of device is critical to the success of a customer and hence the success of the 7400 platform.
However, at this time, there is no single solution to manage this particular cluster of devices. Service provider customers, in particular, desire to have service and subscriber-provisioning tools that provide a full solution, including re-provisioning of clusters in response to failure of a network element in a cluster.
One of the chief concerns during cluster management is the case where one of the devices in the cluster fails. It is a costly solution if human intervention is required to move all the connections from the failed node to a back-up node. Customers need a higher availability solution that will automate fail-over when a node in a cluster fails, and that has a minimal impact on service. Hence, in case of a failure of a node, all connections on that node must be switched to an alternate with minimal effect or no effect on service.
Various failover techniques are known for use with replicated servers in a server farm and in redundant processor scenarios. For example, Cisco Systems has technology known as stateful switchover (SSO) and non-stop forwarding (NSF); however, both are intra-device solutions that can be applied only at the switch level and cannot provide a solution for a cluster or stack of network elements. Currently no approach provides for automatically implementing changes on both an ATM switch and a Router with one tool.
Other solutions include redundant processor cards, but none of them can deal with redundancy across different platforms. For example, the IBM HACMP system and the Tandem NonStop system require total “shadowing” of software, data and hardware resource in the system. The NonStop computing architecture is based on hot-standby replication of cluster nodes combined with transaction processing enforced at the OS level. Rings of cluster nodes regularly ping each other. If a node suspects another failed node, it becomes the ‘leader’ and broadcasts a regrouping message. This is a very expensive and complicated approach that does not address the specific problems outlined above. Thus, prior techniques have not been applied to clustered network devices in a way that addresses the foregoing problems.