Many organizations have begun consolidating servers into centralized data centers, looking to use physical, application or data consolidation as a means of reducing the challenges and costs associated with administering many small servers scattered across the enterprise. To date, physical consolidation has generally involved replacing bulky tower servers with slender 1U or 2U rack systems. The slender rack systems take less space and put the servers and infrastructure within easy reach of the administrator, rather than spread across a large campus. These systems enable organizations to reap many benefits of consolidation, however each server requires its own infrastructure including cables for power, Ethernet, systems management, power distribution units (PDUs), keyboard/video/mouse (KVM) switches and Fibre Channel switches, which offer challenges of their own. For instance, a rack of 42 1U servers can have hundreds of cables strung throughout the rack, making it difficult to determine which cables attach where and complicating the installation and removal of servers from the rack. Further, the PDUs and switches consume valuable rack sidewall space.
The trend has led to even more compact server arrangements such as blade servers. A blade server, e.g., IBM eServer BladeCenter, is a type of rack-optimized server that eliminates many of these complications, thus providing an effective alternative to 1U and 2U servers. Blade server designs range from ultra-dense, low-voltage, lesser-performing servers to high-performance, lower density servers to proprietary, customized rack solutions that include some blade features.
The term “blade server” refers to an enclosure that can hold a number of hot-swappable devices called blades. Blades come in two varieties: server blades and option blades. A server blade is an independent server, containing one or more processors with associated memory, disk storage and network controllers, which runs its own operating system and applications. Each server blade within a system enclosure slides into a blade bay and plugs into a mid-plane or backplane to share common infrastructure elements, such as power supplies, fans, CD-ROM and floppy drives, Ethernet and Fibre Channel switches and system ports. Option blades, which may be shareable by the server blades, provide additional features, such as controllers for external I/O or disk arrays, additional power supplies, etc.
The compactness of systems like blade servers forces otherwise independent servers to share a thermal profile with hardware resources, including enclosures, power supplies, fans, and management hardware, causing power consumption and cooling to become much more critical. Because of the large number of elements housed within blade servers, the airflow and heating patterns within the blade servers are fairly complicated and many possible sources of thermal problems can exist, exacerbating the detrimental effects and maintenance involved with thermal crises and thermal problems that can lead to thermal crises.
A thermal crisis is a situation where a monitored temperature within a thermal system like an enclosure of an electronic system or an element in the electronic system has reached a critical threshold beyond which the equipment should not be operated. A thermal crisis may result from a number of different situations, among them are overheating of an electrical component within the enclosure; partial or total failure of a cooling fan within the enclosure; failure of the room heating, ventilation, and air conditioning (HVAC) equipment; inadequate cooling in the facility (e.g., overcrowded machine room); airflow blocked by an outside obstruction (e.g., close proximity to a wall or materials draped in front of air intakes); removal (without replacement) of an enclosure panel or subsystem (resulting in redirected airflow within the enclosure); and over-configuration (i.e., violation of enclosure specifications regarding limitations for some configurations).
To prevent damage to the electronic system, it is a common practice in the industry to provide internal temperature monitoring and to implement a dual-level threshold scheme to recognize a thermal problem before it develops into a thermal crisis. When a temperature reading is seen to cross a first threshold (e.g., 55° C.) a warning message is generated to notify the operator. The electronic system will often shut down if any sensor reading crosses the second (error) threshold (e.g., 65° C.). In the present state of the art, a major component near the sensor producing an unacceptable reading is assumed to be responsible for the situation and no effort is made to isolate the problem to one of the root causes listed above. It would be desirable to implement a mechanism by which the sensor data is analyzed in the event of an over temperature condition to optimize the response taken.