When online services are used via networked computing environments, interactions with the online services generate data that indicate various characteristics regarding the use of these online services. For example, various electronic interactions via online services (e.g., page views, website visits, webpage reloads) automatically generate data describing these actions (e.g., numbers of page views or website visits for each day of a given time period). Analysis of this data can identify issues that impact the ability of the online service to provide end-user experiences of sufficiently quality, reliability, or both.
Data sets are created by associating groups of information generated by interactions with online services. A data set includes measurements (also, “points”), each of which describes an aspect of the information included in the data set. In some cases, anomalies (also, “outliers”) are present in the data set. One example of analysis that may be performed on data sets generated by online services is anomaly detection. An anomaly includes a point or group of points having a measurement of an unusual event or metric. An example of an anomaly is an point in a data set that has a statistically significant deviation from a majority distribution. Anomaly detection may be performed on a data set (e.g., a network log) to detect, for example, impact of a change in function for a given online service (e.g., errors after an update), responsiveness of end users to certain online content, indications of malware or other suspicious activity, or any other metric indicating a performance level associated with an online service.
A user often wishes to analyze a data set to determine whether anomalies are present in the data set. For example, a website manager reviews a data set describing website performance information to identify unusual increases or decreases in website performance. The website manager makes decisions regarding the website's operation based on the identified anomalies, such as whether to temporarily decommission the website for maintenance. In some cases, decisions based on the identified anomalies impact a large number of people (e.g., website users), or have a large financial impact (e.g., loss of online service). It is desirable that analysis of the data set accurately identify anomalies present in the data.
Current anomaly-detection techniques use a Generalized Extreme Studentized Deviate test algorithm (also, “GESD test” or “GESD”) to identify anomalies. However, the GESD test requires, as an input, an estimated number of anomalies. If the estimated number of anomalies is too low, actual anomalies in the data set are incorrectly identified as non-anomalous (e.g., false negatives). If the estimated number of anomalies is too high, actual non-anomalous points in the data set are incorrectly identified as anomalies (e.g., false positives). In addition, if the estimated number of anomalies is too high, additional iterations of the GESD algorithm are required, slowing the completion of the test and requiring additional computing resources. It is desirable to develop techniques to accurately and quickly identify anomalies. It is also desirable to develop techniques to correctly estimate the number of anomalies.
Other existing anomaly-detection techniques use training models to identify anomalies in data sets. The models are trained using historical data, then provided with the data set of interest for analysis. However, this approach is inadequate if a sufficient amount of historical data is not available, e.g., a smaller set of historical data with relatively few data points (e.g., 10-20 points). In some cases, the model is trained using historical data that is estimated to be similar to the data set of interest. But this approach introduces uncertainty regarding the accuracy of the analysis results, since the training data might not be as similar as estimated. In addition, a trained model requires a seasonal time period, such as a repeated pattern occurring over a time period (e.g., weekly or daily time series). However, not all data sets are related to time. In addition, not all data sets that are related to time have a seasonal pattern.
It is desirable to develop techniques to accurately identify anomalies in data sets (e.g., without false positives or false negatives). It is also desirable to develop techniques to correctly estimate the number of anomalies. It is also desirable to accurately identify anomalies in data sets that do not have historical data, such as smaller size data sets. It is also desirable to accurately identify anomalies in data sets that do not include a seasonal pattern.