Customers communicating over a voice over interne protocol (VoIP) network are severely impacted when an entire network or even a portion of a network is rendered inoperable by unexpected failure of network hardware or software. When a portion of a network is down, the quality of service for a customer is degraded, and if a critical component of a network is down, the basic services offered by VoIP carriers such as phone calling can be suspended.
The problems associated with network failure due to software or hardware malfunctions are further exacerbated by extended periods of downtime due to failure identification. If the malfunctioning components of a network are not readily identifiable, the time taken to repair a problem can be extended by troubleshooting and testing components simply to find the failure. This not only increases the costs to the customer, but also increases the costs of the network provider.
Systems such as operations support systems that have been developed to manage the operations of networks are capable of detecting network failures, but operations support systems are not capable of detecting potential/latent or partial failures before they become a total failure. For example, a software error on a particular network node may significantly slow down a connection, but may not be detected as a problem by typical operations support systems until the software problem develops into a total failure. Likewise, a latent hardware issue such as a faulty wire connection may be impeding system performance and may exist undetected by operations support systems until the connection totally fails.
Operations support systems are also not capable of isolating the location or reasons for a specific network elements or connections that are causing a failure. For example, a particular network node may fail, but because that node has been rendered inoperable, the failed node will not be able to identify itself as a problem. Without a method for readily identifying the failing node or nodes, the problem may only be isolated through time-consuming troubleshooting. If multiple nodes simultaneously failed, a particular problem with one node may not be even be detected until problems with other nodes have been isolated and corrected because visibility to a problem may be masked by other unrelated problems. Thus, there is a need for a method to automatically detect and isolate total, partial, or latent total failures of network elements or connections.