Advanced reliability, availability and supportability (RAS) is becoming increasingly a differentiating factor for hardware manufacturers with the commoditization of servers and other network components in a data center. Automated hardware/firmware analysis is also generally becoming a challenge when it comes to isolating problems spanning multiple subsystems. As computer technology advances, for example with continuous changes to fabrication technologies, analysis of hardware/firmware errors based on past experience and data may not always be appropriate for newer hardware/firmware products.
The lack of historic failure data for new hardware/firmware can result in hardware vendors shipping error analysis algorithms that may not always pin point to a particular field replaceable unit (FRU) or reason for failure. For example, it may be difficult to know if a dual inline memory module (DIMM) error is caused due to bad fabrication, bad connector, dynamic random access memory (DRAM) failure, and/or problem with motherboard's traces and chipset like scalable memory buffers. In another example, several processor cards can plug into and make use of a common system bus where additional memory modules or IO modules may also be plugged into the common system bus. If a fault occurs in a memory controller or IO controller, the fault can appear to be caused by a fault on itself or one of the FRUs of the common bus. Example FRUs are network card, sound card, video card, modem, storage device, and the like.
Existing fault management solutions for such complex error analysis scenarios spanning multiple subsystems, typically, generate a service event along with a recommendation including an ordered set of actions to a support engineer upon detecting a hardware/firmware failure event. For example, in the case of a memory failure, a series of steps starting with reseating a DIMM, replacing the DIMM (if problem persists), and/or replacing mother board (if problem further persists) are generally recommended.