A very common problem addressed by electronic systems is the monitoring of a sensed condition, sometimes at a very large number of locations. For example, an entire city may be instrumented with a large number of sensors for sensing radiation or potentially toxic gases at a wide variety of locations. Another example is that the operating condition of electronic devices may be monitored for various operating conditions. This latter example is relevant in computational clusters. Computational clusters have been developed in order to provide a large number of processors to host large computations which include, but are not limited to, the simulation of complex events. These computational clusters can include tens of thousands of system components. These system components include computational nodes, switch ports, network cards, and storage elements. However, despite large numbers of computational nodes, it can require a considerable period of time for the computations to run. In a representative system, each computational node is housed in a rack-mounted chassis, and a large number of racks each containing, for example, 30-50 chassis arranged in rows. An installation having 50,000 processors with 40 processors per rack would thus require 1,250 racks. It would also have 2 or more network switch ports per computational node and 10,000 storage elements (i.e., hard drive spindles). Insofar as the system components in a massively parallel processing system operate together, it is important to be able to determine when one or more of the system components is starting to malfunction, or becomes likely to malfunction, so that a failure does not occur during a complex and lengthy computation or that if it does the failed component can be quickly identified and swapped out.
One technique that has been used to monitor and analyze electronic devices for impending malfunctions is to monitor the values (or additionally processed values) of the operating conditions of the devices and compare those values to pre-determined threshold limits. If values (or processed values) occur that exceed these limits, then it is believed that failure is imminent, and an alarm is given. For example, in the case of computational clusters, one could monitor operating conditions which include processor temperatures, fan speeds, and memory error rates. This is only a small set of operating conditions that can be monitored and is not meant to be an exhaustive list. Sensors produce analog signals that are processed by an analog-to-digital converter in order to produce digital signals indicative of the sensed conditions' values. These values can optionally be additionally processed by intermediate software layers. The raw or processed values are then received by a monitoring device or devices. This monitoring device or devices then compares the values to predefined threshold limits, believed to be indicative of impending failure. These limits are typically those determined by the design of the system components considered individually, and not with consideration of their placement in the system and the system's environmental conditions.
It is, however, difficult to set the alarm limits in a manner which catches the cases that will result in failure in a timely manner without providing too many false positives. If the threshold is set too low, then one risks getting an excessive number of warnings to deal with, many of which are not reflective of an impending malfunction but are merely reflective of unlikely, but still normal operating conditions of the monitored system. On the other hand, if the threshold is set high enough to avoid these false positive alarms, then the monitoring system may fail to detect a value indicative of an impending malfunction, or may detect it only when the failure is imminent. In this latter case, though the detection has occurred and can be used to preclude catastrophic failure by shutting down the device immediately, it is useless as a mechanism to drive graceful recovery. For these and other reasons, conventional systems for monitoring computational clusters typically fail to signal impending malfunctions with sufficient time margins to allow any response other than shutting down malfunctioning system components before actual failure occurs.
Although the example discussed above is in the context of monitoring computational clusters, essentially the same problem can exist in any system that monitors a large number of sensors. For example, systems that monitor a large number of locations in a city for radiation, poison gases, nerve agents or other conditions may fail to detect an abnormal sensed condition in a timely manner because of the difficulty in setting optimal alarm limits.
There is, therefore, a need for a monitoring and analysis system that can optimally set alarm limits for a large number of sensors, where these alarms limits are reflective of a reasonable operating range of the monitored system as situated in its environment, as opposed to pre-defined variable value limits.
Doing a statistical analysis on the values of a statistically significant number of statistically similar monitored devices could yield the answer to this problem. However, the problem is further complicated by the characteristics of the monitored devices being modified by the environment in which they are situated. The solution then is to somehow extricate the probabilistic model(s) of the monitored devices from the effects of the environment or to include said effects in said model(s) and then perform said statistical analysis using the results.