The complexity of current computing systems and applications provided therein is quickly outgrowing the human ability to manage at an economic cost. For example, It is common to find data centers with thousands of host computing systems servicing hundreds to thousands of applications and components that provide web, computations and other services. In such distributed environments, diagnosis of failures and performance problems is an extremely difficult task for human operators. To facilitate diagnosis, commercial and open source management tools have been developed to measure and collect data from systems, networks and applications in the form of metrics and logs. However, with the large amounts of data collected, the operator is faced with the daunting task of manually going through the data, which is becoming unmanageable. These challenges have led researchers to propose the use of automated machine learning and statistical learning theory methods to aid with the detection, diagnosis and repair efforts of distributed systems and applications.
Current automated machine-learning approaches work well on small computing systems but do not scale to very large distributed environments with sizeable computing systems. There are two main difficulties of scaling up the learning approaches. First, with more components in a large distributed environment, there is a large increase in possible causes of failure and hence an increase in the data measurements, or metrics, that must be collected to pin-point them. This results in what is known as the “curse of dimensionality,” a phenomenon in which current diagnosis methods exhibit a reduction in accuracy for a fixed-size training set, as the number of metrics increases. With more data required and with the increase in the number of metrics, most current learning methods can also become too computationally expensive with longer learning curves. A second difficulty lies in combining different types of data, such as low level system metrics, application metrics, and semi-structured data (e.g., text based log files). Various types of data have different statistical characteristics (e.g., following different statistical distributions) which challenge efforts to combine such data with existing learning methods.