This invention relates to the field of network analysis and network management, and in particular to a method and system for assessing and ranking the effects of failures within a network based on multiple measures of system performance.
With the increased demands for information access, network reliability has become a paramount consideration, and a variety of schemes have been developed to assure at least some degree of communications among nodes of a network in the event of failures within the network. Rarely will a failure on a single device on a modern network cause the network to ‘fail’, per se.
The increased robustness of networks introduces new demands for effective network management. A primary goal for effective network management is to assure virtually continuous operation of the network despite equipment failures. To achieve this goal, the dependency of the network on any particular device should be minimized. However, once the network's basic operation is assured regardless of a failure on any particular device, the assessment of the significance of each device's proper operation on the overall performance of the network becomes ambiguous. That is, if a particular device can cause the network to fail, it is easy to identify this device as a critical device, and measures can be taken to provide alternative paths on the network to eliminate this critical dependency. After all such critical dependencies are eliminated, however, it is difficult to determine where additional safeguards should be provided to minimize the effects of any particular fault on the network's performance.
A variety of criteria are commonly used to assess the effects of a device failure on the overall performance of the network. For example, in some environments, the overall decrease in network bandwidth resulting from a device failure may be considered a viable indicator of the significance of the device to network performance. In other environments, the number of users affected by the failure may be considered a viable indicator; in yet others, the indicators may include the number of switched paths affected by the failure, the number of virtual networks affected by the failure, the number of saturated links caused by the failure, and so on. In general, however, a true assessment of a device's significance in a network includes a combination of such criteria, at which point a comparison of these significances becomes difficult. For example, if one device's failure affects bandwidth more significantly than another device's failure, but this other device's failure affects more switched paths, it is difficult to assess which of these devices are of higher priority for implementing additional safeguards.
Generally, a failure condition affects many aspects of system performance, and different failure conditions will affect different aspects of system performance in different degrees. Because each aspect of system performance is generally measured differently, it is difficult to quantitatively compare the effects of a failure condition on the different aspects of system performance. For example, is a 20% loss in bandwidth ‘better’ or ‘worse’ than a loss of service to 2% of the clients? Or, is this loss of service to 2% of the clients ‘better’ or ‘worse’ than a loss of one Label Switched Path (LSP)? Is the loss of one LSP ‘better’ or ‘worse’ than the loss of two links? And so on.
Further compounding the difficulty in comparing the relative significance of device failures on network performance is the ‘non-linearity’ that typically exists between the measures of performance and the significance of a change in that measure. For example, a ten percent decrease in bandwidth may be considered a ‘minor’ problem, while a twenty percent decrease may be considered ‘major’, and a fifty percent decrease may be considered unacceptable. In like manner, if one failure affects “N” users, while another failure affects “2*N” users, the significance of the second failure may not be twice the significance of the first failure. This other 2*N-user failure may, in fact, have the same significance as the N-user failure in some environments, while in other environments, it may have more than twice the significance.
It would be advantageous to provide a comparative measure for assessing the significance of a failure on multiple aspects of the performance of a network. It would also be advantageous for this comparative measure to reflect the relative degree of significance of each aspect, regardless of the characteristics of the particular measures used to quantify each aspect.
These advantages, and others, can be realized by a method and system that quantifies “network survivability” in such a way that failure cases can be compared and ranked against each other in terms of the severity of their impact on the performance of the network. A rank ordering system is provided to quantify the degradation in network performance caused by each failure, based on user-defined sets of thresholds of performance degradation. Each failure is simulated using a model of the network, and a degradation vector is determined for each simulated failure. To provide for an ordered comparison of degradation vectors, a degradation distribution vector is determined for each failure, based on the number of times each degradation threshold level is exceeded in each performance category. A comparison function is defined to map the degradation vectors into an ordered set, and this ordered set is used to create an ordered list of network failures, in order of the network degradation caused by each failure.
Throughout the drawings, the same reference numerals indicate similar or corresponding features or functions. The drawings are included for illustrative purposes and are not intended to limit the scope of the invention.