The present invention will be described in connection with a computer disk file subsystem for data storage. Such a storage device may be referred to as a Direct Access Storage (DAS) subsystem. Nevertheless, those skilled in the art will recognize that the invention described may be incorporated into other computer devices, and particularly into various I/O devices having built-in error detection and recovery capabilities.
The occurrence of a recovered I/O device error event may or may not indicate that a service action is required to restore normal operation. If I/O device performance is permanently degraded due to an error event, or if error events recur often enough to noticeably degrade I/O device performance, then a service action should be scheduled. However, a service action is typically not recommended if I/O performance is not noticeably degraded.
The decision whether or not a service action is required is typically made through a manual and emperical process. Detailed error symptom reports are called up and examined to determine some degree of problem severity and permanence effecting machine operation. Because of this complicated process, the decision to call or not call for service action is often based on an inaccurate understanding of the real problem. An incorrect decision can be costly in terms of either lost performance or the replacement of non-defective parts.
The service action decision process to isolate the cause of a possibly very intermittent or usage pattern sensitive fault is also typically conducted through a manual and emperical evaluation process. The preferred prior art fault isolation technique is to use maintenance diagnostic programs to recreate an error symptom. However, this technique is inefficient at recreating intermittent error symptoms. An alternative is to manually analyze error symptom history data to derive a fault syndrome that can be equated to a probable service action. The service action then is to systematically replace suspect parts until error events are no longer reported, as an indication that normal operation has been restored.
An input/output (I/O) device 11 of the prior art is shown with its error response mechanism 13 in FIG. 1. The I/O device receives an I/O operation command to read or write data. If the I/O device successfully completes the operation, it can respond that the operation has been completed without error (response (1)). If, however, an error is detected, the I/O device proceeds with its error symptom analysis and error recovery procedure, and either recovers successfully, or fails to recover. If the operation is successfully recovered, it can transmit a command response that the operation is complete, but reporting a recovered error symptom (response (2)). If the operation is not recovered, the I/O device issues a command response that the operation is incomplete, and includes a damage report and a report of the non-recovered error symptom (response (3)).
When the I/O device reports successful completion of the operation without error, that condition reflects normal machine operation. A report of operation complete, but with a recovered error symptom report indicates that the I/O device operations can continue, but that the error symptom data should be examined to determine if a service action is required. A report that the operation is incomplete, with the damage report and the non-recovered error symptom requires immediate attention to restore normal I/O device operation. A service action may or may not be required as a result of either of the error symptom reports. Any service action requirement must be determined external to the I/O device by manual analysis of error symptom data.
For example, in a DAS subsystem of a computer system, as the user performs write and read operations, transferring data to and from the subsystem, the disk file will periodically, in the manner known, generate a data check message or an equipment check message indicating that some other than routine event has occurred. Data checks are, in many disk files, somewhat expected events, and thus are counted and recorded in ordinary bookkeeping associated with the DAS subsystem, and may not indicate that a service action is required.
Usage data related to the DAS subsystem is also accumulated. Such data might include which I/O devices are being used, seek counts, and data transfer accounts. This is sometimes referred to as usage information.
Conventionally, a report is generated periodically (such as daily or weekly) noting all of the recovered exceptions or other potential error events that occurred since the last reporting took place. A person will analyze this report to determine if the events recorded represent a problem with the I/O device, and whether corrective action is needed, and what that corrective action might be. Nevertheless, under current systems, this analysis must be performed using only the printed reports to note exception events and detect trends in error events.