1. Field of the Invention
This invention relates to networked data storage systems, and more particularly, to reliability analysis of failures in disk drives used in such systems.
2. Background Information
A storage system is a computer that provides storage service relating to the or ganization of information on writable persistent storage devices, such as memories, tapes or disks. The storage system is commonly deployed within a storage area network (SAN) or a network attached storage (NAS) environment. When used within a NAS environment, the storage system may be embodied as a file server including an operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g. the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
In the client/server model, the client may comprise an application executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. NAS systems generally utilize filebased access protocols; therefore, each client may request the services of the filer by issuing file system protocol messages (in the form of packets) to the file system over the network. By supporting a plurality of file system protocols, such as the conventional Common Internet File System (CIFS), the Network File System (NFS) and the Direct Access File System (DAFS) protocols, the utility of the filer may be enhanced for networking clients.
A SAN is a high-speed network that enables establishment of direct connections between a storage system and its storage devices. The SAN may thus be viewed as an extension to a storage bus and, as such, an operating system of the storage system (a storage operating system, as hereinafter defined) enables access to stored information using block-based access protocols over the “extended bus.” In this context, the extended bus is typically embodied as Fibre Channel (FC) or Ethernet media (i.e., network) adapted to operate with block access protocols, such as Small Computer Systems Interface (SCSI) protocol encapsulation over FC or TCP/IP/Ethernet.
A SAN arrangement or deployment allows decoupling of storage from the storage system, such as an application server, and placing of that storage on a network. However, the SAN storage system typically manages specifically assigned storage resources. Although storage can be grouped (or pooled) into zones (e.g., through conventional logical unit number or “lun” zoning, masking and management techniques), the storage devices are still pre-assigned by a user, (e.g., a storage system administrator, as defined hereinafter) to the storage system.
Thus, the file server, as used herein, may operate in any type of storage system configuration including a NAS arrangement, a SAN arrangement, or a hybrid storage system that incorporates both NAS and SAN aspects of storage.
A file server's access to disks is governed by an associated “storage operating system,” which generally refers to the computer-executable code operable on a storage system that manages data access, and in the case of filers, implements file system semantics. In this sense, the NetApp® Data ONTAP™ operating system available from Network Appliance, Inc., of Sunnyvale, Calif. that implements the Write Anywhere File Layout (WAFL™) file system is an example of such a storage operating system implemented as a microkernel. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
The storage devices in a file server environment are typically disk drives organized as a disk array, wherein the term “disk” commonly described a self-contained rotating magnetic media storage devices. These include hard disk drives (HDD), direct access storage devices (DASD) or logical unit number (lun) storage devices. Disk storage is typically implemented as one or more storage “volumes” that comprise physical storage disks, defining an overall logical arrangement of storage space. As will be understood by those skilled in the art, the rotating magnetic media storage devices contain one or more disk platters that are accessed for read/write operations by a magnetic read/write head that is operated electromechanically by a mechanism that includes hardware such as motors, bearings and the like, as well as firmware and software, which provide supervisory instructions to the hardware mechanisms. This assembly as a whole is referred to herein as a “disk drive.”
In a typical file server implementation, a plurality (e.g., hundreds) of individual disk drives are arrayed in a field installation to provide storage organized as a set of volumes or similar multi-drive arrangements. The disk drives are manufactured and shipped for use in NAS, SAN or hybrid environments. These storage environments incorporate multiple disk drives from the world's leading manufacturers into storage systems that are deployed in either a central location or which may be geographically dispersed. The entity that designs and implements the storage system is referred to herein as the “storage network provider.” The customer who purchases the storage system from the storage network provider, and makes storage services available via a client/server model or otherwise through the storage system, whether it is in the NAS or SAN configuration, is referred to herein for simplicity as a “user.” An individual entity that makes requests via the NAS or SAN in order to access data contained therein is referred to herein as a “client.” Either the storage network provider, or the user, or both of these entities may from time to time provide overall supervision and maintenance (e.g., software updates, etc.) and may provide reliability analysis and other controls, as discussed herein, with respect to the storage system. That person or entity that is providing configuration, supervision, maintenance and/or reliability assistance for a storage system is referred to herein as a “storage system administrator.”
As noted, the storage network providers order disk drives for a field installation from third party manufacturers. Thus, a field installation can contain disk drives from several different disk drive manufacturers. Moreover, disk drive manufacturers often create what are known as drive “families.” Within a drive family, the drives are nearly identical, except for the number of disks and read/write heads. Drive families are used to maximize commonality between products thereby reducing design and manufacturing costs, while addressing the market's need for multiple capacity points such as 18 gigabytes (GB), 36 GB and 72 GB. Each family of drives typically goes through its own unique design and manufacturing process.
As also noted, each SAN or file server implementation incorporates a large field installation base including hundreds of disk drives for each drive family from each manufacturer. Given the large number of disk drives in a typical implementation, there is a reasonable likelihood that one or more disk drives will experience an operational problem that either degrades drive read-write performance or causes a drive failure. This is because disk drives are complex electromechanical systems. Sophisticated firmware and software are required for the drive to operate with other components in the storage system. The drives further incorporate moving parts and magnetic heads which are sensitive to particulate contamination, and electrostatic discharge (ESD). There can be defects in the media, rotational vibration effects, failures relating to the motors and bearings, and other hardware components or connections. Some problems arise with respect to drive firmware or drive circuitry. Environmental factors such as temperature and altitude can also affect the performance of the disk drive.
Thus, drives can fail and the failure can be significant if there is a nonperformance of the drive. Therefore, it is important for a storage system administrator to understand the mechanisms by which disk drive errors occur, especially those errors that could ultimately result in a drive failure. To that end, error information such as error codes may be useful in determining whether there are any conclusions to be drawn about a particular drive family from a particular drive manufacturer so that the manufacturer can be notified of performance issues that arise in the field.
However, even though most disk drives incorporate an error code reporting capability, simply recording error codes does not provide enough information to fully evaluate the underlying reason for the error code having been generated. Error codes are typically reported by an individual drive, and global studies of drive families by SCSI error codes have not been available.
In the past, reliability analysis has been confined to predicting the time at which a predetermined percentage of components can be expected to have failed. This utilizes field failure data and combines such data to predict the probability of failure in a particular device over time. These studies are typically top level analyses which statistically predict a failure rate of a particular type of drive or a drive family, and have not been directed to specific underlying symptoms or causes. More specifically, error codes usually identify a physical symptom of an error, such as a loose connection or an open circuit. This symptom is known as a “failure mode.” The underlying cause of the symptom, such as the physics that results in the failure may be a phenomenon such as “corrosion,” (in that corrosion can lead to an incomplete or loose wire connection, etc.). This is known as a “failure mechanism.” Studies have not been available which evaluate failure modes and failure mechanisms and how these may change during the operating life of a disk drive, or in a family of disk drives.
Studies have been performed which compute either an annual or an annualized failure rate, which is a rolling three month average that assumes a constant failure rate for all drives. This type of study further assumes that the probability of failure is equally likely in any fixed period of time. This, however, is rarely true as set forth in IDEMA Standards, “Specification of Hard Disk Drive Reliability,” document number R2-98. This procedure does not account for the changing failure rate nature of disk drives and the fact that some failure mechanisms contribute to failure rates at one point in the life of a drive, whereas other failure mechanisms dominate at subsequent points in the operation of a drive.
In addition, there is typically no mapping of error codes to specific failure mechanisms for specific drive families. Thus, the typical top level analysis does not provide an accurate picture of the overall rates of change of failure mechanisms over time for drive families.
There remains a need, therefore, for an improved method of performing reliability analysis of disk drive failures that can take into account how different failure modes and/or failure mechanisms can contribute to an overall failure rate over the course of the life of disk drives. There remains a further need for a method of reliability analysis which can be focused on a particular drive family from particular third party manufacturers so that failure mechanisms that frequently occur with respect to that drive family can be identified and addressed.