In enterprises with large IT infrastructures, monitoring of infrastructure elements (servers, applications, network elements etc.) is necessary to ensure that any infrastructure problem is detected as quickly as possible. Examples of monitored status data entities include the latency of a process, the availability of a server and the throughput of an application. “Normal” or desirable behaviour may be associated with the monitored data and deviation from the desirable behaviour could be arranged to trigger an “event” which is brought to the attention of an operator or otherwise used to bring the monitored entity to the “normal” state. As a large number of entities are usually monitored, an integrated console (also known as a dashboard) is often used to facilitate the monitoring of the current state of the system. Given the large number of monitored entities, it is desirable that the dashboard prioritizes information so that the information is easy to navigate.
There are many dashboards currently available on the market. Some available dashboards are specific to the systems management environment, while others can be used as visual consoles in many different environments. Most of the dashboards provide features such as customizable interface, alarm displays, hierarchical view of performance metrics, ability to drill down to details, graphing/trending capability etc.
Most systems management dashboards systematically organize the sensed (measured) data and the state, by displaying the data/state within an appropriate context. Examples of context are the metric name, the process name, the application name, the name of the server on which the application is running, the line of business under which the application is running etc. If latency of process A, running on server B, was being measured, the context associated with the value could be the string <Metric latency><Process A><Server B>. Here Metric, Process and Server are context categories or meta-data, while latency, A, B are context value or instances.
It should, however, be appreciated that the above discussed representation is only one possible representation of the contextual relationship between the various data levels. Each context value can be considered as a node in a graph, and depending upon the “topology” of the data centre, there could be many ways in which this graph could be connected or organized, e.g. a server may have many different applications running, each application may be accessing the same database etc. Here it is assumed that any of these connected graphs can be mapped to a hierarchical graph where each level of the hierarchy corresponds to a context category and nodes in the level correspond to context values associated with the category. An example of such a hierarchy is shown in the dashboard view snapshot of FIG. 1. The context categories or metadata are Line of Business (LOB) 11, Service 12, Location 13, Application 14, Server 15, and Metric 16. The actual context values, which are not numbered for simplicity, are CreditCards for LOB, Billing for service, and EMEA, AP, US for location. The applications being monitored are PrintBill and Rating, while jupiter, neptune, ganga, etc. are servers. Finally qsize/db, availability/process, utilization/cpu are instances of metric. The mapping of an arbitrary connected graph to a hierarchical view can be done trivially by replicating the shared nodes.
Time series data, consisting of measurement samples, is associated with each level of the hierarchy. While the “leaf” data is always the measured data, i.e., data from the monitoring system; the data at a higher “node” in the tree could be measured data or aggregated data. Given the selection and ordering of the context categories in the above example, the metric data 16 is grouped by server 15, applications 14, locations 13, services 12 and LOB 11, in that nesting order. Thus the data at the node jupiter could be an aggregation of the metric 16 data corresponding to the measurements of availability/process, availability/connectivity, qsize/db, utilization/cpu and qsize/os. The user could supply aggregation functions, or they could be built-in functions that could aggregate by doing simple operations like union, intersection, addition etc.
In large enterprises the number of entities that are being constantly measured is very large. There are many possible context categories or meta-data items and with each meta-data there would be a large number of instances or values. Thus the default hierarchy, in which all meta-data are selected and ordered, would be also very large. Not all nodes in the hierarchy would be equally informative, i.e., the entropy of data corresponding to the nodes would be different. Also the meta-data selection and ordering could be changed to generate a different hierarchy with nodes that convey different information. Finding the right hierarchy and then traversing the hierarchy to find which nodes to observe closely can be a very difficult and time consuming navigational process. Accordingly it is desirable that a dashboard offers the user a prioritised view including navigation help beyond selection and drill up/down, in order to help the user to identify which particular time series should be observed more closely and in what context should they be observed.