One of the primary responsibilities of service providers is to ensure that their services provide a level of performance and robustness that satisfies the commitments specified in their Service Level Agreements (SLAs) with customers.
An existing approach is to monitor the quality and behaviour of the services by measuring system-internal performance characteristics (such as round trip delay, available bandwidth and loss ratio), identify unusual or anomalous activity that either directly indicates or indirectly implies that the service is no longer behaving satisfactorily, and identifying (root) causes of service performance degradations.
It has been proposed to provide measurement and metadata from the network resources and utilise the metadata to enable a system to automatically generate the topology and dependencies between the resources to assist in identifying root causes.
Besides the measurements and topology, logic and decision criteria are needed to automatically find the cause of service degradation. Today's solutions do not apply a generic methodology that solves this problem for all system services, merely specific solutions for a specific system service are developed where the problem cause is known in advance and the root cause analysis system may come to a result using decision trees or other likewise methodologies.
Existing solutions which are based on decision trees rely on probabilities or priorities (to decide which branch the reasoning shall go ahead with). There are many disadvantages to use of decision trees. For example, in order that priorities or probabilities can be determined, a training or “learning” period is needed. Furthermore, the priorities or probabilities that have been assigned to a decision may not be accurate. A decision tree may miss causes of service degradation because of earlier decisions (based on probabilities) to decide which branch to follow for further processing misses out a cause present in another branch which has not been followed. Further, a decision tree may not reflect real-time network status when calculating probabilities.
Existing solutions provide complex systems and algorithms for analysis of alarms and measurements to find specific service problems. The actual automation for finding the offending resource is often limited to specific problems and only works after the service has been delivered. With these solutions it takes times to find the root cause of service degradation, making the system limited and impractical to implement for real time analysis.
Existing solutions for network-wide measurement and performance estimation pay little attention to requirements for compatibility or inter-operability. These systems are usually point solutions, use different performance metrics, employ various underlying measurement mechanisms, and often operate off-line only. Though diverse in underlying mechanisms, these systems have the common goal of providing system-internal characteristics to applications, and their measurements overlap significantly.
Furthermore, existing solutions require complex systems and domain specific knowledge in order to correlate information from different resources to be able to find the root cause.
Existing solutions also rely on unstructured network measurements and thus try to make the best out of the available measurements. The lack of metadata and co-relation information in a node of a network providing measurements makes it very hard to correlate measurements from different resources especially at a session level.
Due to the lack of inter-operability between data sources, existing Root Cause Analysis (RCA) solutions require huge human intervention; automation in existing root cause analysis solution has very limited usage in today's telecommunication networks with heterogeneous resources and services.
Probe based solutions provide measurements but lack the metadata and addressing information from the resources involved in the service delivery to enable a large degree of automation.