The increasing complexity of the architecture of cellular networks has made network management more complicated. Self-Organizing Networks are those networks whose management is carried out with a high level of automation. In the field of Self-Healing or automatic troubleshooting of Self-Organizing Networks, an enormous diversity of performance indicators, counters, configuration parameters and alarms has led operators to search for intelligent and automatic techniques that cope with faults in a more efficient manner, making the network more reliable.
A purpose of Self-Healing is to solve or mitigate faults which can be solved or mitigated automatically by triggering appropriate recovery appropriate recovery actions. One of the main barriers to Self-Healing research is the difficulty to know the effects of any given fault, which is the fundamental basis to build an effective diagnosis system.
Fault diagnosis in Self-Organizing Networks, which may also be called root-cause analysis, is a key function in fault management that allows the identification of the fault causes. In this sense, some efforts have been devoted to the development of usable automatic diagnosis systems that improve the reliability of the network.
“An Automatic Detection and Diagnosis Framework for Mobile Communication Systems”, P. Szilagyi and S. Novaczki, IEEE Transactions on Network and Service Management, 9(2), 184-197, 2012, discloses an integrated detection and diagnosis framework that identifies anomalies and finds the most probable root cause.
Improvements to this framework are covered in “An Improved Anomaly Detection and Diagnosis Framework for Mobile Network Operators”, S. Nováczki, In Proc. of 9th International Conference on the Design of Reliable Communication Networks (DRCN), 2013, where more sophisticated profiling and detection capabilities have been included.
WO2014040633A1 discloses a method for determining faults through pattern clustering.
“System and method for root cause analysis of mobile network performance problems”, J. Cao, L. Erran Li, T. Bu and S. Wu Sanders, WO 2013148785 A1, October 2013 discloses a method for identifying the causes of changes in performance indicators by analyzing the correlation with a plurality of counters.
Fault diagnosis in cellular networks has also been approached by applying different mathematical techniques, such as in “Automated diagnosis for UMTS networks using Bayesian network approach”, R. M. Khanafer, B. Solana, J. Triola, R. Barco, L. Moltsen, Z. Altman and P. Lázaro, IEEE Transactions on Vehicular Technology, 57(4), 2451-2461, 2008, and in “Advanced analysis methods for 3G cellular networks”, J. Laiho, K. Raivio, P. Lehtimäki, K. Hätönen, and O. Simula, IEEE Transactions on Wireless Communications, 4(3), 930-942, 2005.
Many existing Self-Healing solutions comprise reasonably primitive approaches to fault diagnosis, while the more complex alternatives require a lot of information about the faults that is not available in most cases, e.g. the conditional probability density function of metrics (symptoms) for given fault causes. Due to these shortcomings automatic diagnosis systems, beyond scientific literature, have not been deployed in live networks.
In known systems, fault diagnosis may be on the basis of how consistently a metric is associated with a fault. In particular, the association of a metric with a fault is considered to be fully consistent if the metric is always present in (or always missing from) a metric report relating to the fault. The metric report may include all metrics that deviate from their usual behavior at the time of a fault. A drawback of this approach is that, in many cases, the effect on the metrics is not a clear deviation from a normal range, but may be a small change (e.g. a peak or a step) in the temporal evolution of the metric that would be disregarded. The more information is available, the better performance of a fault diagnosis system. For example, in the case of a cell outage, any impact on neighboring cells, even if small (e.g. a slight increase in traffic), can be used in fault diagnosis.
In known systems, this issue is even more problematic due to the use of thresholds. For example, an anomaly class may collect metrics having similar effects and each class is then characterized by an anomaly class indicator that is activated when the corresponding metrics violate predefined thresholds. Since the anomaly behavior in the metrics is given by reaching abnormal values, the information related to smaller variations and specific degraded patterns is ignored. In addition, in this case the use of thresholds leads to a more drastic decision when determining whether a metric is degraded or not.
In other systems, diagnosis is carried out by means of classification/regression trees, which are used to predict membership of event counters in one or more classes of performance metrics of interest. However, this kind of solution is typically based on fixed thresholds, so that similar drawbacks as before are derived from this approach.
The application of Bayesian Networks has an important limitation. In particular, models must contain all the possible states of the network and their associated probabilities. The construction of this model is a complex task where knowledge acquisition becomes an extremely challenging issue and is normally not feasible given the lack of time of troubleshooting experts.
Other systems use Self-Organizing maps. For example, proposed methods based on this technique facilitate the diagnosis when the cause of the problem is unknown. Since a large number of labeled cases (i.e. identified faults associated with their symptoms) is hard to get from recently deployed networks, this is a reasonable starting point for the diagnosis. However, there can be deviations in metrics (e.g. due to traffic variations) that are not a problem but the Self-Organizing Maps would classify them as a potential problem, causing some confusion to the troubleshooting expert.