The present invention relates to data management, and more specifically, to automatic computer storage medium diagnostics.
Operating errors in storage products typically lead to downtime. For example, a storage product may be taken offline in order to troubleshoot errors or perform error checking in general. In some cases, a system may ignore the source of error yet may slow down because the source of error may affect operating efficiency. For example, hard disks may be subject to errors which can have several causes. Some key sources of error include for example, hardware (physical disk or adapter) failures, an unsupported disk model or firmware level errors, operating system failures, or erroneous storage application code. Sometimes a disk will issue ‘alerts’ that a single sector was taken offline and data may be written to another sector on the same disk. In other cases, the disk may have time-out values that are inappropriate for operation of the disk. Disk alerts may yet arise even though the alerts indicate ‘normal’ activity for a disk. If a disk is identified as having failed an operation with each alert, then the storage product may be unusable or subject to constant physical maintenance by live personnel.
In conventional troubleshooting of storage products, the customer may perform a series of manual steps to diagnosis which part of the storage product has the error. Sometimes the hardware is at fault and other times it may be that the software is the root cause for the issue. To diagnose the problem, a user typically manually takes a device, such as a hard disk offline and runs diagnostics to determine the root cause. In some cases, multiple sources of error may be interlinked. Identification of the root cause may entail figuring out whether the hardware is affecting the software or vice versa. Once the device is offline, the diagnostics may be run until some resolution is achieved and then the user must manually reactivate their device. Manual intervention may mean that low level alerts may not necessarily be used to run extensive diagnostics, even though the customer has plenty of bandwidth to disable the device and execute diagnostics. To do so may be too disruptive to the end user. Thus user confidence in their storage product may suffer.
Since traditional troubleshooting methods may require manual intervention, the diagnostic tests often do not perform intrusive error checks which may require writing to the disk itself. Intrusive or destructive actions may not be authorized until the user takes the disk offline. Thus, the diagnostics may not run until after a catastrophic event has occurred, which is typically when a user decides or is forced to act on the disk(s).