Web services such as Facebook and Amazon appear to users as a unified service. Behind the user's view, these types of services are built on complex systems of components such as routers, switches, servers and server clusters, and databases, to name a few. These components combine to present a single front-end and the appearance of a unified service. These types of services are distributed systems and may be built by integrating distributed subsystems. Each subsystem may be specialized to a small number of functions. One subsystem might manage user authentication, while another handles file search, and yet another may handle data storage. These distributed systems, which may be composed of hundreds or thousands of machines, are deployed within or across large data centers. Such distributed systems are often referred to as cloud services.
Users expect cloud services to be available and responsive. However, within a data center, even under normal operating conditions, given the scale and complexity of the hardware and software of a cloud service, at any one time many hardware or software components may be in various degraded states such as failing, undergoing upgrade, or failed. Typically, cloud services are built with duplication and resilience to minimize the impact of these problems on performance or availability. Nonetheless, components of a cloud service can cause the cloud service to fail. For example, failure of a component involved in an unanticipated dependency can lead to a significant service outage, or too many critical components may fail.
Operators who monitor and manage cloud services, and developers who build tools for operators, may have goals such as proactively identifying problems before or as they occur, localizing and diagnosing problems that arise in the field, and assuring unanticipated failures are not triggered during a service upgrade (during which time the system is particularly vulnerable). However, current tools for operators of these systems are inflexible, and in general do not allow flexible visualization at varying scales, including visualization of very large scale services/systems and visualization through varying levels of size and organization down to individual machine and software components. Current tools make use of elements such as lengthy lists or tree-views of individual components, which are impractical for visualizing cloud services that may involve thousands of components. Generally, tools that can visualize individual component machines/servers cannot visualize how such components are organized or how they are functioning as a unit. Tools that can provide a high level view of a service do not provide views into individual machines/servers. Furthermore, such tools are incapable of reflecting the many levels of organization and the varying relationships between organizational units. Even in the case of tools that allow navigation of a hierarchy, such tools do not do not aggregate data in a way that reflects a system's organization. For example, there are no visualization tools that aggregate, in a flexible way, information about clusters of machines or information about groups of clusters.
Not only are current tools inflexible, but they also fail to take advantage of information that may be available. A wide variety of configuration and usage data may be available for viewing behind each component of a cloud service. As new features are regularly added to a service, corresponding new sets of logging features grow more numerous and become unmanageable. In sum, operators are not lacking in data about their cloud services. However, they are lacking tools for gaining rapid insight from the mass of available data.
Operators of cloud services aim to identify anomalies and problems but a high degree of replication and a high degree of natural variability in workloads of components can make this difficult. It may be that one server in a cluster is running slow: perhaps its disk is failing, and disk seeks are being retried. A set of databases may be overloaded due to specific content becoming popular. Workload aberrations may cause sharp increases in the computational loads within a cluster. Response times may increase because of increased complexity of answering the individual requests. The types of systemic problems are limitless. Operators lack tools for identifying anomalies and problems across distributed systems, and in particular correlating events and trends across the highly replicated structure of these services where variations over time are more informative than baseline averages.
Mathematical and statistical approaches have been used to address this correlation and anomaly detection problem. The mathematical approach, while useful, does not take advantage of the human mind's ability to rapidly synthesize visual information. Techniques described below relate to allowing developers of cloud services to easily build effective, customized visualizations of cloud service configuration, behavior and health.