The present invention relates to techniques for monitoring computer systems.
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is critically important to ensure high availability in such enterprise computing systems.
To achieve high availability in enterprise computing systems it is necessary to be able to capture unambiguous diagnostic information that can quickly pinpoint the source of defects in hardware or software. Some high-end servers, which cost over a million dollars each, contain hundreds (or even thousands) of physical sensors that measure temperatures, voltages and currents throughout the system.
These physical sensors are typically monitored through either polling or interrupts. In systems that monitor sensors through polling, the value of a physical parameter monitored by the sensor is queried (polled) by software at some preset (typically adjustable) sampling interval. Note that polling creates time series values for each polled sensor.
In systems that use interrupts to monitor sensors, a value for a monitored parameter reported by the sensor only if the value triggers an upper or lower threshold value (also called a latch limit, or latch threshold).
An advantage of interrupt-driven sensors is that there is continuous assurance that the physical parameter being monitored by the sensor is within its specified operation bounds (as long as the sensor is still working). Note that in polled sensors, there are gaps between observations, and it is possible for the physical variable being monitored to stray outside of the operational bounds during these gaps. Although one could reduce the size of these gaps by sampling at a higher frequency, doing so consumes additional bandwidth on the service bus. This can cause bandwidth problems on the service bus if hundreds or thousands of sensors are being polled through a sequential polling process. Note that if many sensors are being polled, and each polling operation takes a fixed amount of time, there is an upper limit to the number of sensors that can be polled at a given sampling rate.
FIG. 1 illustrates a measured parameter 102 versus time 104 for an interrupt-driven monitoring system. An interrupt-driven monitoring system typically has an upper limit 106 and/or a lower limit 108 for the measured parameter. As illustrated in FIG. 1, measured parameter 102 varies with time and is normally between upper limit 106 and lower limit 108.
If measured parameter 102 falls below lower limit 108, the physical sensor measuring measured parameter 102 generates interrupt 110. Likewise, if measured parameter 102 exceeds upper limit 106, the physical sensor generates interrupt 112. Note that interrupts 110 and 112 can be generated only once, as the value of measured parameter 102 crosses the lower limit 108 and upper limit 106, respectively. Alternatively interrupts 110 and 112 can be generated continuously while measured parameter 102 is out-of-bounds, or they can be generated only at the transition points where measured parameter 102 goes out-of-bounds and in-bounds.
FIG. 2 illustrates a measured parameter 102 versus time 104 in the case of a sensor failure in an interrupt-driven monitoring system. In this example, when the physical sensor fails at 202, the value reported by the physical sensor is stuck between upper limit 106 and lower limit 108. In this situation, no interrupts are generated and the failed sensor is not reported. Furthermore, if the measured parameter 102 actually passes upper limit 106 or lower limit 108 after this sensor failure, no interrupts are generated.
FIG. 3 illustrates a measured parameter 302 versus time 304 in a polled system. Note that in the polled system, no upper limit or lower limit is monitored by the physical sensor. Instead, the value of measured parameter 302 is read periodically (at polling points 306) and these parameter values are forwarded to an analysis system to determine if measured parameter 302 is out-of-bounds. Note that since polling is a sequential process, the polling frequency may be limited by the number of sensors within the system. For example, suppose a system includes a thousand sensors and each polling operation for a given sensor takes 3.5 milliseconds. If this system polls all of the thousand sensors sequentally, the time interval between consecutive polling operations for a given sensor is at least 3.5 seconds. Consequently, it is possible that an out-of-bounds parameter will not be recognized for up to 3.5 seconds. This 3.5 second delay in taking evasive action can potentially lead to a catastrophic failure, which may have been averted if the out-of bounds signal had been detected sooner.
An advantage of polling over interrupt-driven sensors is that there is a wealth of diagnostic/prognostic information contained in the values gathered during the polling process, even when the values are safely between their threshold limits. For example, using values obtained during the polling process, it is possible to infer correlations between signals. By monitoring these correlations, it is possible to detect system anomalies even when measured parameters are not out-of-bounds. This can provide an earlier and more sensitive indication of a possible incipient problem. Secondly, if a sensor fails in such a manner that it keeps its last mean value, but is no longer responding to the variable it is monitoring, applying simple pattern recognition algorithms to the polled responses can easily catch this failure.