The present invention relates to key performance indicator selection and, more specifically, to the selection of key performance indicators for anomaly detection analytics.
In order to maintain or secure mission-critical systems, businesses often rely on monitoring systems that can help predict, detect and/or diagnose problems. However, when incidents occur in a complex production environment, it takes a tremendous amount of effort to investigate and determine the root causes of those incidents based on information provided by the monitoring system in use. For instance, a subject matter expert might need to analyze data related to metrics involved in an incident over a time period and for any given anomaly a large amount of effort can go into building up the pattern of metrics that helps an administrator understand and address the situation at hand.
Presently, the solution is complex and requires the writing of rules or situations such as those found in products, such as the IBM Tivoli Network Manager (ITMN). There, the approach is to build a situation from a series of rules (e.g., if CPU usage goes above X and memory usage drops below Y within time period Z, then raise an alarm). This type of solution involves a significant amount of manual encoding/rule writing and can be error prone and seen as a common source of pain to users. Thus, in many organizations, it has become common to “metricize” applications to enabling monitoring and this has led to an explosion in the number of possible metrics and the different systems from which these metrics can be monitored. It therefore has also become practically impossible and infeasible to write rules for the quantity of metrics that are available.
Solutions, such as smart cloud analytics predictive insights, aim to address the problem of metricized applications by applying large scale data mining techniques to automate the “rule writing” for users. Automated rule writing results in subsets of key performance indicators (KPIs), which involve metrics that are typically organized into groups, to be selected at a group-level granularity at an entry point to the system and at stage known as mediation. The KPI organization might be, at the metric level (e.g., “Response Time”) or at the resource level (e.g., “Response Time on WebSphere Servers”).