Data are being created at an astonishing rate, and the utility of manipulating, correlating, and searching these data has given rise to a related problem: how to store the information so that large quantities of data can be saved reliably and accessed quickly. One approach that has proven effective is to provide a storage server: a special-purpose processing system used to store and retrieve data on behalf of one or more client processing systems (“clients”).
A file server is an example of a storage server. A file server operates on behalf of one or more clients to store and manage shared files. It typically presents a uniform interface to data which are actually stored on one or more mass storage devices (such as magnetic or optical disks or tapes). The file server provides a layer of abstraction so that its clients can ignore the individual mass storage devices and treat the server as a single large (and often expandable) disk.
In a large scale storage system, it is inevitable that one or more individual mass storage devices will experience operational anomalies from time to time. For example, although a hard disk may have a typical access time of a few milliseconds, the system may detect that a read or write operation on the disk did not complete until a few seconds or tens of seconds after the operation was issued. Alternatively, a mass storage device may provide a status indicator to inform the storage system about a recently-completed I/O operation, and the status may show that, although the operation completed correctly, a retry was required or an error-correcting function was invoked. These anomalies may be true transient events, triggered by a chance sequence of timings and operations, or they may provide an important indication of impending device failure. Since it may not be economical to replace a device before it is known to be faulty, the ability to distinguish transient from predictive anomalies may be desirable.