1. Field of the Present Invention
The present invention generally relates to fault detection in computer systems and more particularly to a method and system for detecting and addressing intermittent and other fault conditions that are difficult to detect or reproduce.
2. History of Related Art
Electronic devices are susceptible to a wide variety of conditions that may result in the generation of an error code or fault condition. The complexity of sophisticated electronic devices including computer systems can make the task of identifying and addressing fault conditions extremely difficult. Computer system users who have initially encountered a condition that generates an error code, error message, or other fault condition are frequently unable to reliably reproduce the condition in the presence of a customer service engineer. When the service engineer is unable to replicate a fault condition, the engineer will either assume that the user caused the condition or that the condition is not longer affecting operation. In either event, the service engineer is unable to address the problem and both the user and the service engineer are left unsatisfied. Moreover, the service engineer will frequently have to revisit the system when the fault condition reappears. The service process described is slow and costly and causes customer dissatisfaction. Thus, it is highly desirable to provide a mechanism by which a service engineer can objectively verify that an error or fault condition has occurred. It is further desirable that the implemented solution be economical and compatible, to the extent possible, with existing systems.
The problems identified above are in large part addressed by incorporation of a fault detection mechanism into the electronic device. The fault detection mechanism is adapted to record the occurrence of a fault condition and to preserve the record until the fault condition is repaired or otherwise eliminated. Broadly speaking, the invention contemplates a device such as a tape drive or disk drive unit and a computer system that incorporates the device. The device preferably includes a controller or processor and a non-volatile storage element configured with microcode suitable for execution by the controller. In an embodiment suitable for use in the computer system, the controller is preferably configured for communicating with a peripheral bus of a computer system via a bus interface unit. The device further includes a non-volatile fault indicator and fault logic suitable for detecting a fault condition in the device. The fault logic is adapted to program the non-volatile fault indicator upon detecting a fault condition to preserve the occurrence of the fault. In this manner, both repeatable and intermittent fault conditions are documented for subsequent servicing by a service engineer.
The programming of the fault indicator preferably occurs as a portion of a fault recovery routine executed by the device in response to the detection of the fault condition. In one embodiment, once the fault indicator has been programmed, it is erased, cleared, or otherwise reset only when the component of the device associated with the fault indicator has been replaced. The fault condition that triggers the programming of the fault indicator is a condition that would cause a diagnostic program appropriate for the device to indicate a failure. The fault indicator is preferably read as part of the diagnostic program and, if programmed, the fault indicator causes the diagnostic program to indicate that a failure has occurred. In one embodiment, the fault indicator comprises a portion of the non-volatile storage element such that only a single non-volatile device is required. One embodiment of the invention includes multiple additional non-volatile fault indicators, where each of the non-volatile fault indicators is associated with a corresponding component of the device.
The invention further contemplates a method of recording the occurrence of fault conditions in which the internal logic of a device is exercised and a fault condition in the device is detected. In response to the detection of the fault condition, the occurrence of the fault condition is recorded by programming a non-volatile fault indicator of the device to preserve the occurrence of both intermittent and permanent fault conditions. The internal logic may be exercised during normal operation by a user of the device or computer system or by execution of a device diagnostic routine or program by a service technician. The diagnostic program preferably includes a step of reading the fault indicator and, if the fault indicator is programmed, indicating that a failure has occurred such that the diagnostic program will continue to indicate the failure until the fault indicator has been cleared. In an embodiment in which the fault indicator comprises a portion of the system""s non-volatile storage element or boot code device, the step of setting the fault indicator is accomplished by programming one or more bits of the non-volatile storage element.