1. Field
The present embodiments relate to techniques for monitoring and analyzing computer systems. More specifically, the present embodiments relate to a method and system for detecting and managing power supply unit degradation in a computer system by analyzing telemetry data from the computer system.
2. Related Art
As electronic commerce becomes more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is important to ensure high availability in such enterprise computing systems.
To achieve high availability, it is necessary to be able to capture unambiguous diagnostic information that can quickly locate faults in hardware or software. If systems perform too little event monitoring, when a problem crops up at a customer site, service engineers may be unable to quickly identify the source of the problem, which in turn may lead to increased down time.
In particular, power supply units (PSUs) for high-end computer servers are typically manufactured by power supply vendors instead of by server manufacturers. Such commodity PSUs may lack internal diagnostics and/or sensors that report fan failures caused by gradual degradation in bearings, lubrication, mechanical parts, and/or fan motors in the PSUs. Because fan failures in commodity PSUs may go unnoticed, the PSUs may continue operating without the fans until temperature increases in the PSUs and/or server components result in server shutdowns and/or other failures. For example, a fan failure in a PSU within a server may cause both the PSU and a set of processors in the server to heat up until the server is shut down by a thermal trip.
Furthermore, techniques for replacing degraded PSUs in servers are frequently associated with manual investigation and/or unnecessary costs. For example, a technician may find a degraded PSU in a data center by holding a tissue next to air vents in PSUs and identifying the air vent that does not produce airflow. Similarly, a worldwide recall of PSUs may require that all PSUs for a particular platform be replaced, even if only a fraction of the PSUs is expected to fail.
Hence, what is needed is a mechanism for identifying and detecting degraded PSUs before failures result from the degradation.