System performance management in a computing system has traditionally been based on collection of data from multiple sources, which are then processed and presented to system administrators for analysis. Depending on the complexity of the system, different levels of aggregation, threshold detection, pattern recognition, etc., are applied to the data before it is presented for analysis. Such complex computing systems may generate thousands of dynamic performance metrics in the form of key performance indicators (KPI's) with time-varying values, which makes it challenging to manage the metrics manually. In this regard, automatic alerts may be used that are based on predetermined thresholds or rule sets that indicate a malfunction when triggered.
However, configuration of these rules for alerts, whether static or dynamic, is often difficult in that it may involve expertise in two separate disciplines. In particular, it requires a deep understanding of the relevant technology domain generally associated with a domain expert, and also mathematical skills generally associated with a data scientist, who provides the set of tools and/or algorithms to automate the collection, filtering, and analysis of the data. For example, a domain expert may be proficient in the relevant technology and the interrelationships between the various components of the monitored system. However, the domain expert may not be familiar with the tools and algorithms to automatically gather, filter, and analyze the vast amount of KPI's generated by a complex system. Indeed, such analysis is typically the realm of the data scientist, who may not have a deep understanding of the relevant technology and the interrelationships between the various components of the system.
While tool developers may use traditional approaches to find a compromise between domain experts and data scientists to provide customized solutions for defining and/or updating system alerts, such tight coordination between the two principles generally does not allow a quick turn-around time and typically results in the sub-optimal performance of the system.
Accordingly, it would be beneficial to have an automated and efficient way of developing intelligent alerts that are operative to diagnose existing and/or impending malfunctions in a complex system, such as a data network. It would also be beneficial to provide a method and system of creating intelligent alerts with a high confidence level that avoid false positives and do not require substantial mathematical knowledge. It is with respect to these considerations and others that the present disclosure has been written.