Next generation services in telecommunication systems are expected to be executed on a so called telecom cloud, which combine the flexibility of today's computing clouds with the service quality of telecommunication systems. Real-time service assurance will become an integral part in transforming the general and flexible cloud into a robust and highly reliable cloud that can ensure low latency and agreed service quality to its customers. The agreed service quality is typically specified in a Servicer Level Agreement (SLA). Therefore, a service assurance system for telecommunication services must be able to detect and preferably also predict problems that may violate the SLA, i.e. problems that may cause SLA violations. This is a complex task already in legacy systems and will become even more challenging when executing the services in the telecom cloud. Furthermore, the service assurance system must be able to diagnose the detected problem. Finally, the service assurance system must be able to remedy, in real time, the problem once it has been detected.
One promising approach to realize the service assurance system is based on machine learning. In such service assurance system, the service quality and behavior is learned from observations of the system. An ambition is to do real-time predictions of the service quality and in case of service quality degradation and/or SLA violations perform a root cause analysis, aka a root cause inference. Thanks to the root cause analysis actions to mitigate the service degradation may be taken. The actions should preferably remedy the detected faults and restore SLA fulfillment as soon as possible to minimize impact of potential penalties due to the SLA violation(s), which in turn results from the service degradation.
In existing service assurance systems, machine learning has been used to build prediction models for service quality assurance. For example, predicting user application quality-of-service (QoS) parameters, quality-of-experience (QoE), and anomalies for complex telecommunication environments. With machine learning, some sample data is used for training a statistical model which later can be used to provide generalized predictions for unseen data. In this way, a mechanism for detection of fault(s), sometime even before the fault(s) have occurred, may be realized.
Predicting the SLA violations, e.g. by means of machine learning, is one tool for enabling delivery of high-quality services. A provider of a telecommunication system can take timely and faster actions based on predictive analysis compared to traditional customer support services. However, this becomes even more useful when coupled with analytics based automated real-time fault localization to diagnose the detected faults. This would have required significantly more effort including human intervention involving more money and time otherwise.
As will be described in more detail below, exiting solutions have limitations, e.g. in terms of accuracy and computation requirements, in the way they attempt to do fault localization.
Some existing solutions are quite simplistic, and thus less accurate. As an example, a simplistic solution may only employ some static threshold mechanisms on performance metrics data collected to detect fault(s) and then employ a manual root cause analysis to localize the fault(s).
Then there are other methods which deploy the less precise and more computation intensive clustering techniques on the system metrics data to detect faults and then do automated fault localization. For example, in “CloudPD: Problem determination and diagnosis in shared dynamic clouds” to Sharma, Bikash, et al., Dependable Systems and Networks (DSN), published at the 43rd Annual Electronics Engineers (IEEE)/International Federation for Information Processing (IFIP) International Conference in 2013, a solution for CouldPD employs a fault detection mechanism which includes localization of faults based on statistical correlation techniques. It uses pairwise correlation computation for metrics within a Virtual Machine (VM) as well as across VMs. This is done for the current interval and the last known good interval from the recent history. If the deviation in correlation value between these two time intervals is bigger than a defined threshold for any metric then that metric is added to a list of culprit metrics and this culprit list is used to build fault signatures to do the fault classification. The time intervals are usually about 15 min.
A problem with this solution may be that the time intervals are too long for some applications, for example during quickly varying load conditions in the network. This may lead to late, or even absent, detection and/or localization of faults, which also delays, or fails to provide, the possibility to take actions to remedy the faults.
In “An ensemble MIC-based approach for performance diagnosis in big data platform” to Chen, Pengfei, Yong Qi, Xinyi Li, and Li Su, published in Big Data, 2013 IEEE International Conference on pp. 78-85, IEEE, 2013, a solution for the CloudPD is proposed for performance diagnosis. During fault localization in this solution Maximum Information Criterion (MIC) is used for establishing correlations between the variables which allows capturing non-linear relationships. The correlations are computed only between the performance metrics locally (1-way), i.e. within one local server machine on a specific data platform, e.g. Hadoop.
There also exist monolithic solutions which combine the fault detection and localization phases in one big operation which hinder scalability of the solution.