1. Field of the Description
The present description relates, in general, to monitoring of components of a computer system such as field replaceable units (FRUs) and the like, and, more particularly, to a method and system of tracking fault status of sensors such as those provided on FRUs and other portions of a computer system to address problems associated with the removal of a faulty FRU and its replacement (e.g., with the same or a differing faulty FRU) and other operating challenges (e.g., loss of power) that can result in improper identification of operating states within a computer system.
2. Relevant Background
There has been significant progress made in recent years that facilitate accurate monitoring of operation of computer systems. For example, management of servers and systems with servers has been standardized through the use of Intelligent Platform Management Interface (IPMI) and other tools. IPMI facilitates interconnection of a central processing unit (CPU) and devices being managed by providing replication of monitoring functions from server to server and supporting large numbers of monitoring devices. IPMI is an industry standard, hardware management interface specification that provides an architecture defining how unique devices can communicate with the system controller or service processor, which runs embedded software and controlling firmware (sometimes referred to as a Baseboard Management Controller (BMC)).
One example of the management functions facilitated by the IPMI is monitoring server operations such as the operating status of field replaceable units (FRUs). FRUs are circuit boards, parts, or assemblies of a computer system that may be replaced often while the computer system is running (e.g., via hot swapping). To this end, the IPMI may receive information from the FRU and/or from sensors on each FRU, which may allow it to determine FRU presence (e.g., is a processor blade plugged into a server) and its present operating state (e.g., operating within predefined ranges (or “OK”) or faulty based on sensor readings). Typically, according to the IPMI standard, the IPMI software or module may act to transmit an event to a fault manager or fault diagnosing module upon a state change of a sensor (such as a change from OK to faulty).
While generally providing adequate state information to determine faults in FRUs and other monitored components, the existing standardized monitoring techniques or guidelines do not address all operational situations, and this can result in an inaccurate state being reported to a system controller or other management devices. For example, some sensors can become unreadable such as a sensor located on an FRU that is removed or a sensor that is not readable when one or more power domains is turned off.
With regard to the FRU example, a sensor may transition to faulty, and the IPMI software processes this transition information from the sensor. As a result, an event is sent to the fault manager or fault diagnosing module of the system controller, which, in turn, diagnoses the FRU as faulty. When the faulty FRU is removed from the computer system, the sensor becomes unreadable. Based on FRU presence information, the IPMI software notifies the fault manager that the FRU was removed from the computer system, and the fault manager deletes the fault as having been serviced or corrected. A problem situation can now arise, though, if the FRU is re-inserted while still faulty (or a new FRU that is faulty is inserted), the IPMI module (or its sensor logic) will not detect any transition because being unreadable was treated as a temporary status according to the IPMI specification and not a state change. As a result, no event is sent by the IPMI module, and the fault manager will not report to a system controller or other management devices that the FRU is faulty.
Hence, there remains a need for methods and systems for providing more accurate tracking of fault status of sensors such as those provided on or with FRUs. Preferably, such methods and systems would at least provide a solution to the example provided above regarding insertion of a faulty FRU or an FRU with a latent fault.