A file server is a computer that provides file service relating to the organization of information on storage devices, such as disks. The file server or filer includes a storage is operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
A filer may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, e.g., the filer. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the file system on the filer by issuing file system protocol messages (in the form of packets) to the filer over the network.
A file server's access to disks is governed by an associated “storage operating system,” which generally refers to the computer-executable code operable on a storage system that manages data access, and in the case of filers, implements file system semantics. In this sense, ONTAP software is an example of such a storage operating system implemented as a microkernel. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
The storage devices in a file server environment are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with a hard disk drive (HDD), a direct access storage device (DASD) or a logical unit number (lun) in a storage device. Disk storage is typically implemented as one or more storage “volumes” that comprise physical storage disks, defining an overall logical arrangement of storage space.
In a typical file server or storage area network (SAN) implementation, hundreds of individual disk drives are arrayed to provide storage organized as a set of volumes or similar multi-drive arrangements. Given the large number of disks in a typical implementation, there is a reasonable likelihood that one or more disk drives will experience an operational problem that either degrades drive read-write performance or causes a drive failure. Some problems relate to drive firmware or hardware, including magnetic media, spin motor, read/write head assembly or drive circuitry. Such firmware and hardware problems generally dictate that the disk drive be returned to the original manufacturer for repair or replacement. Other potential problems are user-related, and often result from software problems within the storage operating system or user applications.
A typical user may not be able to differentiate between a disk drive experiencing more-serious firmware/hardware faults or less-serious software problems. Rather the user/administrator often performs only a basic diagnostic of the drive (if possible), and submits the required warranty claim with a brief explanation of the problem (usually in the form of a return merchandise authorization (RMA)) to the vendor's customer service. The explanation may, or may not, accurately describe the problem. Some drives may utilize proprietary methods to record mechanical failure information in internal logs (e.g. SMART data). However, this information is (typically) only available to disk drive vendors and does not allow for operating systems such as Data Ontap to provide input on the nature of why a disk might have been failed.
As a large volume of potentially faulty disk drives are returned, the vendor's customer service department must determine whether the drives are truly faulty or are not faulty. In order to correctly determine the type and degree of problem, each returned disk drive is subjected to a series of failure analysis tests and procedures (read/write test, zeroing of all media locations, etc.) on an appropriate test bed. If the drive passes all tests, it is either returned to the original user or zeroed and placed back into stock for reuse by other customers as so-called refurbished goods. If it fails a test, then it is usually forwarded to the original manufacturer or another facility for repairs and/or credit.
Some faults may elude customer service's diagnostic process if they are hard-to-spot or intermittent in nature. Other faults may linger even after a repair is completed. As such, customers may experience the same fault or problem in a recycled disk drive again and again. It is desirable to provide a readable and writeable storage area within a disk drive that allows error or fault information to be appended to the drive for both diagnostic and historical purposes. This would aid in correctly diagnosing the fault and determining whether a recurring, potentially irreparable fault exists. However, appending this information to the storage media (i.e. the magnetic disk platens) is not necessarily an effective approach, since the media is often one of the more failure-prone elements in a disk drive, and is susceptible to erasure under certain conditions. Rather, a more robust nonvolatile storage location is desired for storing diagnostic and fault information.