With the development on the scale and functionality of large networked systems such as the digital networking systems (DNS) and clouds, the complexity of system increases significantly, making it more and more important to detect system anomalies or failures in a proactive and reliable manner. This can greatly reduce the risk of disrupting system services which is usually associated with huge economical loss. In order to keep track of the global system conditions, a monitoring system is usually deployed that records the running status of important local components/modules/sub-systems. To perform failure or anomaly detection, one method detects anomalies in noisy multi-variate time series data by employing a sparse temporal event regression method to capture the dependence relationships among variables in the time series. Anomalies are found by performing a random walk traversal on the graph induced by the temporal event regression. Another method makes the fault detector available as a service to applications.
Generally, such a system is composed of several failure detection agents running inside a distributed environment, each being responsible for the monitoring of a subset of processes and the update of the applications.
Adaptive protocols can be used for anomaly detection. These protocols adapt dynamically to their environmental and, in particular, adapt their behavior to changing network conditions. These adaptive approaches typically require domain knowledge of the distributed systems as well as some interference with the system (such as the response of certain test signals in order to check whether the system is running normally).
A family of related algorithms is causal inference particularly based on the sparse granger causality method. The typical approach in this family of algorithms is to use auto-regressive (VAR) models to compute the relation between multiple time series. In order to make the connections sparse, an L1-norm regularization is added such that only a small subset of causal relations will be identified as significant ones. Other methods are used where the causal structure is determined purely from statistical tests. These methods only focus on identifying the temporal causality relation in the whole system, however, they do not consider the problem of further identifying anomalies in the system.