The present invention relates to a method and apparatus for facilitating hardware fault management in a computer system, for example a computer server system.
In a conventional computer system, when a hardware error occurs, one or more server domains will crash. Typically, this can result in one or more error messages being printed on a console, and eventually the affected domain(s) will reboot. During this reboot, a Power On Self Test (POST) utility or the like may catch the faulty component(s) and deconfigure that component or components from the system. However, this approach has its disadvantages. When the system restarts, POST has to recreate the error during its diagnostic phase, diagnose the faulty component and deconfigure it out of the system. However, users often set the diagnostic level of POST to a low value thus restricting POST's diagnostic capabilities. Thus, even after a domain reboots, there is no guarantee that the fault has been isolated and will not recur. POST may also not detect the error, which might have been triggered by some specific sequence of events. Also, information regarding the crash is typically not saved.
Accordingly, there is a need for an improved method and apparatus for facilitating hardware fault management in a computer system, for example a computer server system.