In networks providing communications services one of primary responsibilities of service providers is to ensure that their services provide a level of performance and robustness that satisfies the commitments specified in their service level agreements (SLAs) with customers, and at the same time maintain efficient use of resources. A well known approach is to monitor quality and behaviour of the services, identify unusual or anomalous activity that either directly indicates or indirectly implies that the service is no longer behaving satisfactorily, and analyse root causes of service performance degradations.
In general, there are two types of causes for service problems: hard failure and soft failure. Hard failure refers to a failure event such as a link breakage or a node going down, which is well observable through network alarms or other notifications in Operation And Maintenance (O&M) systems. Compared to hard failure events, a soft failure is less noticeable but recurring network condition. For example, the term soft failure refers to service degradations which are short in duration and are caused by performance impairing events that occur intermittently over an extended period of time. Problems may have disappeared before a network operator can react to them. Such problems may recur and keep reappearing, and can cause repeated service degradation to user services. In some cases, such conditions develop slowly and can aggregate over time before it eventually turns into a serious hard failure. For example, repeated wireless flaps may be observed over time before the link completely fails. Even if the problem does not result in any hard failure, the performance degradation caused can add up to significant impact to user services and system services and affect user satisfaction with service quality. Undoubtedly, the necessity of discovering the underlying root causes of such soft failures is at least as critical as hard failures before the conditions can be permanently eliminated from the network. It is essential to troubleshoot and repair such network conditions in a timely fashion in order to ensure high reliability and performance in large mobile networks.
Existing O&M solutions are designed to diagnose hard failures, especially alarm based Fault Management and Performance Management systems. In contrast, the solution to be disclosed is designed to detect and localize service problems caused by soft failures in mobile wireless infrastructures.
Diagnosis of such soft-failure service problems in wireless infrastructures, like for example 3G/LTE networks with thousands of access nodes where normal service behaviour is highly dynamic, presents serious challenges.
(1) Changing network topology. User Equipment (UEs) may frequently change their point of attachment and accordingly network or service paths. It is difficult to build causal relations (or inference graphs) between service problems and network events to pinpoint the causes of performance problems. Existing network fault diagnosis systems, either passive or active (i.e. probing), implicitly assume that the fundamental structure of the network is either static or changes slowly. This assumption allows these systems to build inference graphs to pinpoint the causes of performance problems. However, these approaches cannot be applied to mobile networks.
(2) Absence of full-scale continuous probing or monitoring points. As discussed earlier, soft failures are unpredictable and therefore require full-scale measurements. However, in a network primarily consisting of roaming users, full-scale network-wide probing or monitoring infrastructure is impractical and expensive to deploy.
(3) Difficulties in reproduction of the problematic conditions. The problems may not be detected based on probing solutions if the problematic conditions are not reproducible.
Service performance monitoring is one known diagnosis solution. It uses terminal reports, or passively collected packet traces, to calculate service performance metrics (e.g. round-trip times). Terminal reports refers to Quality of Service (QoS) reports (e.g. RTCP reports for RTP streams; throughput, packet loss, latency and jitter) and Quality of Experience (QoE) reports (e.g. 3GPP TS 26.234; based on HTTP or RTSP). Packet traces can be used to analyze service-level quality. Service quality alarms can be obtained from service quality monitors deployed within the service provider network. The monitors gather statistics such as packet loss, packet delay and service outage durations. These statistics are then used as quality indicators for services.
Using network events and/or counters is another known solution. Network events and/or counters are defined and configured for network events or any other data sources (such as terminal types) to calculate statistics. The data analysis is normally based on simple statistical methodologies, typically highest-failure-rate style analysis using Business Intelligence tools. For example, this approach can calculate percentage of attach success ratio or Packet Data Protocol (PDP) context activation success ratio for a particular mobile device type based on subscriber session analysis. This may enable the operator to identify service problems caused by, for example, designs of mobile devices.
However, neither service performance monitoring nor network events or counters are capable of accurately identifying root causes of soft failure service problems. Service performance monitoring may only disclose symptoms of service problems and fail to determine their causes. On the other hand, network counters/events are usually unaware of services or applications. In the best case, failure events can only be associated with PDP sessions.