1. Field of the Invention
The present invention relates generally to systems and tools for diagnosis and isolation of failed components in complex systems.
2. Background of the Invention
Whereas the determination of a publication, technology, or product as prior art relative to the present invention requires analysis of certain dates and events not disclosed herein, no statements made within this Background of the Invention shall constitute an admission by the Applicants of prior art unless the term “Prior Art” is specifically stated. Otherwise, all statements provided within this Background section are “other information” related to or useful for understanding the invention.
Component failures in complex systems are difficult to isolate, particularly in a situation where a catastrophic or cascading failure has occurred. When a computer system component fails, the failure may present itself to the end user in various ways, where often the perceived result of the failure is not indicative of the actual component that has failed.
For example, if a network adapter fails in a computer such that network communication is no longer possible, the adapter will become an immediate candidate for testing and replacement. If, on the other hand, the network adapter fails in such a way that the failure causes a short-circuit on the system-bus and induces a system power failure, the observed symptom of a network adapter failure is not readily related back to the failed component—it may be initially diagnosed as a power supply or backplane failure. Thus, a failed network adapter in this manner would hide the true failure that caused the outage. Additionally, as long as such a short-circuit on the network adapter exists, it would not be possible to successfully apply power and boot the system.
Using conventional troubleshooting techniques in this example scenario, it is likely that the first component replaced in this case would be the power supply. When the replacement of the power supply failed to repair the system, the next likely components to be replaced would be the system board and backplane. Additionally, in such a system-wide failure expensive components such as system processors may become suspect.
To further exasperate a situation such as this, during troubleshooting, the non-suspect parts would typically remain in or be reinstalled in the system, and thus may cause further damage to existing and new system components installed. In such cases, much effort and expense may be exhausted on a problem that could have been repaired quickly and inexpensively if the failing component had been properly identified upon initial failure of the system.
Therefore, there is a need in the art for a system and method to report upon individual system components when a failure occurs, even in the case where power can no longer safely be applied to the system, independent of internal system capabilities, such as busses and power supplies.