Field of the Invention
Embodiments of the present invention generally relate to environmental management of data centers and, more particularly, to a method and apparatus for providing environmental management using smart alarms.
Description of the Related Art
Typical environmental alarms are based on two types of alarm events: 1) level-crossing events such as temperature crossing above an alarm level, and 2) status alarms such as a mismatch between a binary command (e.g., “ON”) and a status indicator providing binary feedback about whether or not the commanded unit has turned “ON”. There are three problems with these types of alarms. The first is that for some systems, such as the cooling and temperature management system in a data center, there may be hundreds or thousands of the same logical sensor type or status indicators in the same managed space.
For example, in a typical 10,000 square foot data center, an environmental management system may have more than 200 sensor points, where most of them are rack inlet air temperature points measuring temperature at a particular cluster of sensors. Many sites with an environmental management system are up to 100,000 square feet with over a thousand sensor points and over one hundred cooling units each with status indicator sensor points. Upon a cooling failure event at a large site, hundreds of level-crossing (high-temperature) alarms and dozens of status alarms occur, flooding notification systems such as email and text messaging systems, making it difficult for operators to determine the extent and scope of a problem. Even in regular temperature conditions, the system has many sensors and indicators, significantly increasing the chances of false-alarms from a single sensor or indicator failure.
The second problem is that level-crossing alarms, such as high-temperature alarms, are a lagging indicator of a problem. For example, if a temperature sensor is high, then the elapsed time between the root-cause event of the high temperature condition and the alarm notification time is lost. In some applications, such as a cooling failure in a high-density data center, this lost time may force a service interruption that could have been prevented if the time between root cause and alarm notification had not been lost. Low level-crossing thresholds may be selected in order to preemptively compensate for this lost time, but low thresholds can result in false alarms.
In some instances, the time between root cause and level-crossing alarm notification can be avoided if an alarm is directly coupled to the root cause event. For example, if the root cause event is a cooling unit failure and if an alarm can be raised on the mismatch between the unit command and the unit status, then the status alarm can be a leading indicator of a high-temperature condition. However, this leads to the third problem, which is that not all status alarms indicate high-priority conditions, but some do. For data center cooling management this is because there is normally redundant cooling so that even if one unit or a small number of units fail, the temperature in the data center should remain under control. But sometimes a single unit failure can cause a severe problem due to lack of redundancy. For example, a partial failure of the cooling system may cause local temperatures in the data center to rise high enough to trip a fire suppression system, which may potentially shut down the remainder of the cooling units causing temperatures in the data center to exceed 130° F.
Therefore, there is a need in the art for improved environmental management alarms.