The present invention relates in general to managing mixed-vendor, meshed communication networks, and, more specifically, to predicting re-route interactions in association with maintenance actions on network elements.
With the increasing complexity of communication networks, such as layered protocol networks (IP over ATM, or IP over Frame Relay) used with the Internet and synchronous optical networks (SONET) used in telephone networks, network management functions such as network element provisioning and configuration, load monitoring, and fault detection become increasingly important. Currently, network element management systems collect network element resource status via various management protocols including TL1, CMIP, CORBA, SNMP, and others. Network element resources may, for example, include routers, switches, servers, bridges, multiplexers, and other devices depending upon the type of network. The availability of element resource status allows external (i.e., management) systems to determine the load and utilization levels of the device, the fault state (or impending fault states), and current configurations. This collected information allows network service providers to perform the standard FCAPS (fault, configuration, accounting, performance, and security) functions associated with managing a network.
Recently, it has become more and more popular to relegate the control portion of the network element configuration function of FCAPS to a logically separate “control plane” system. In some cases, the control functionality is housed within the network element itself, and, in other cases, it is housed in near proximity to the network element. The control functionality is concerned principally with fulfilling explicit or implicit end-user requests (e.g., requests whose response time is clearly discernable by the end-user). These functions typically involve providing a transient connection or allocating processing or storage resources to a specific user request. Fault detection, correction, and restoration after failure of these resources are also typically handled by the control plane system.
Traffic load levels within a network impact the performance of all network elements. To maintain a reasonable system cost, networks are typically over-subscribed for their potential peak traffic rates. In other words, the available resources of the network could not support all possible requests that could potentially occur at the same time (e.g., the telephone network cannot support a situation where every telephone is in use simultaneously).
In a meshed network, each network element is connected to many other network elements so that network traffic can potentially reach its destination by many different paths. Due to the large size and complexity of most networks, network elements from a variety of vendors/manufacturers are typically present. Unfortunately, when pieces of the control functionality are shared across multiple vendors and/or element types, the restoration steps taken for a resource becomes unpredictable. Since the various vendors and/or network element types do not have an agreed upon standard method of restoration between themselves, restoration actions must be coordinated above the network element level to be rational and predictable. Restoration that is coordinated external to the network element level is frequently too slow to fall within the service level agreement (SLA) allowances.
When a failure of a network element or other error occurs making a communication path in a meshed network unavailable, the traffic that was being handled by a number of transport paths, x, must then be handled by x−1 paths. In an IP network, for example, the error correction action (i.e., re-convergence) automatically re-routes traffic paths over the remaining links after some amount of convergence time. This process, however, does not take SLA parameters into consideration when determining how paths are re-routed. Consequently, for a premium network service such as video conferencing, SLA requirements for a limited transport latency and/or jitter may be violated by the newly converged configuration and/or by the re-convergence action itself (e.g. re-convergence takes 10 minutes on an SLA budget that allows 5 minutes outage annually).
Some network elements attempt to lessen these problems by allowing an operator to provision a failover resource to take over as a backup when the main resource becomes unavailable. In that case, no communication is required between network elements when a failure occurs—network operation merely switches from the failed resource to the provisioned backup resource. Manual provisioning of failover resources is undesirable in a typical network, however, because thousands of network element resources are present and it burdensome to manually configure and/or reconfigure all these resources. In addition, providing a failover resource for each network element doubles the resource requirements (and cost) of the network, or configured failover resources are re-used which increases the likelihood that failover resources will already be in use when they are needed.
A typical method for reducing the number of resources required for failover protection is to provide one failover resource per each group of n resources, where n is the number of network element resources in a group to be served by one failover resource. This tremendously improves the resource utilization of the network, but has led to other problems. More specifically, when either 1) multiple network elements fail simultaneously, 2) operation of two or more network elements is suspended while performing maintenance actions on the network elements, or 3) a network element fails while another network element is down for maintenance, then multiple requests for the same failover resource can occur (e.g., two or more resources of the “n” group are out of service at the same time and traffic from both is switched to the same failover device). For instance, if two operators independently perform maintenance on two different resources in a network, it is possible that the re-routing generated by the out-of-service resources will failover to the same alternate resource at some point in the network. Furthermore, the number of nodes (i.e., elements), the number of connections between the nodes (i.e., links), and the number of virtual paths traversing the nodes prevents a network operator from understanding the likely interactions created as a result of any particular failover. As a result, network performance is degraded and may result in noncompliance with the provider's SLA, or worse yet, a cascade failure scenario.
There are currently network analysis tools (such as the Conscious™ traffic engineering system available from Zvolve Systems, Inc.) that will analyze element load levels and resource allocations based upon information retrieved from element management system data. These tools will then generate suggested resource allocation for any new requests, and they can be used to predict SLA violations for resource outages by adding requests for resources currently used by the failed element to a model of the network not including that element. Unfortunately, these tools do not predict what a network element will attempt to do when a failover occurs. This results in a situation where unpredicted and/or uncoordinated actions take place in a failure situation, and where the response to a failure situation is unacceptably slow (e.g., the network tool identifies an error and re-configures the network elements based upon its observations, in a process that takes from minutes to hours).