The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Ensuring high availability, performance and reliability of an Internet service requires extensive telemetry data. In this regard, it is routinely touted by service providers that they monitor hundreds of millions of metrics. Over the years, data collection has become a commodity. The challenge however is to analyze the data to mitigate the customer impact owing to the issue at hand. Given the deluge of metrics, a key first question in the context of Root Cause Analysis is: into which metrics should one dive deeper? Broadly speaking, the question can be dissected into the following two ways: 1) which metrics have changed “significantly” compared to their respective histories; and 2) in which order should one analyze the metrics from step 1.
One can potentially argue that the above can be done visually. Primarily, there are two reasons why this is not feasible: i) owing to large volume of metrics ii) more importantly, it is error prone. FIG. 1 illustrates these challenges. On the left column (101), 7 candidate metrics are depicted, each containing a reference section (in blue) and a query section (in red). Due to compression of the y-axis, it is difficult to tell that there is a change of nearly 20 percent of mean value between the reference and query portions of the highlighted metric (102). One can look at these metrics one at a time, as in the right column (103), to be able to detect such subtle differences more easily. However, inspecting the metrics one by one would make it difficult to rank the metrics by the strength of relative changes and it will be very time consuming.
A system that automatically ranks metrics by differences would provide invaluable guidance to the operational personnel to kick-off his/her deep dive analysis. The guidance helps to minimize the Time-To-Resolve which in turn is key for end-user experience and thereby to the bottom line.