There are extant monitoring systems for the performance of disk drives in responding to access requests, such as read and write commands. Such systems include those using input/output (I/O) error counters, SMART counters, disk throughput statistics gathering (such as the “iostat” command in Linux), and other hardware and software monitoring systems. There are many causes of low performance by failing disks including read retries from the disk medium, mechanical issues such as wear and vibration, and corrupted storage media.
However, most of the extant systems are only indicative of actual or potential disk failure and do not gauge performance in view of the execution of the incoming access requests for the disk. Disks that exhibit performance lower than an “acceptable level” can thus be deemed as “failed” even though these disks do not have any other indications of failure (e.g. SMART or I/O errors) and may still be able to perform in an operative, diminished capacity.
In large storage arrays of disks that handle I/O requests from numerous other devices and processes, the monitoring of the performance of the disks is constantly ongoing, typically with automated action removing damaged disks which either do not meet certain thresholds of acceptable performance or have failed. In typical large-scale storage systems, the determination of disk performance failure is based on aggregated disk statistics such as throughput, rate of dispatched I/O operations, await, mean service time, total service time and other measures. Due to the unpredictable nature of disk I/O patterns particularly in a large storage array handling many I/O requests, disk performance measurements based on these aggregated metrics have to trade off sensitivity in detecting short term I/O pauses within the array against a false alarm rate for potential disk failures within the array, i.e. the underperformance of several disks falsely indicating that one disk has likely failed.