A computer cluster may include a set of resources. The resources may include, for example, a processor, a central processing unit(s) (CPU), a memory cache(s), an interconnect(s), a disk(s), an “abstract resource(s)” (e.g., resource modeled in operating system), and so on. These resources may be connected together and may have a complex mesh of dependencies that may change over time. In one example, a cluster may be thought of as a group of independent entities (e.g., servers, resources) that co-operate together as a single logical system. This type of cluster may be used to provide load balancing, high-availability, and so on. An example cluster may include multiple computers, multiple storage devices, and redundant interconnections to form what appears to be a single highly available system. Different resources may independently and asynchronously generate events, take actions, experience loads, consume other resources, experience failures, and so on. Thus, different resources may be monitored and measured using a complex set of metrics that describe performance, availability, and so on. These metrics may be direct measurements including, for example, CPU usage, bandwidth availability, bandwidth consumed, memory usage, cache usage, hard drive usage, and so on. The metrics may also concern more complicated measurements concerning, for example, the size of a queue (e.g., input/output queue), whether the queue is growing, and so on.
Things that are measured tend to be things that are monitored. Thus, systems have developed to report on metrics available for cluster members. Conventional cluster monitoring tools have been single node oriented. However, at least one conventional measurement tool provides a static display of selected sets of metrics for multiple cluster nodes. This conventional tool performs distributed monitoring of clusters and uses a multicast-based listen/announce protocol to monitor state within clusters. The conventional tool may logically federate clusters and aggregate their state based on a tree of point-to-point connections amongst representative cluster nodes. In this conventional approach, a node monitors its own local resources and distributes multicast packets with information concerning those monitored local resources. The multicast packets may be provided to a well-known multicast address periodically and/or upon the occurrence of a pre-determined, configurable event.
However, as noted above, conventional tools have provided static displays of selected sets of metrics for cluster nodes. These static displays may be overwhelmed by a system having a large set of nodes and may in turn overwhelm a user trying to understand the provided information or to predict possible deterioration of function. Thus, conventional tools may provide data that is analyzed in a cluster post-mortem. The post-mortem may identify trends and tendencies that may have been difficult to identify while the cluster was still functioning. The difficulty may have arisen due to issues with representation. For example, display space may have quickly been overwhelmed by data associated with a large set of cluster elements (e.g., 100 nodes, 1,000 processes per node, 5,000 disks and interconnections) that provide a large set of measurements (e.g., 10 measurements per element). Also, it may have been difficult to decide what to display and how to group related things, if relationships could even be identified. Conventionally, data may have been displayed using histograms, which may be inappropriate for a data set of the size (e.g., 106 monitored entities) and character described above.