File Servers
A file server (also termed herein “filer”) is a computer that provides file services relating to the organization of information on storage devices, such as disks. A file server or filer includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the disks. Each “on-disk” file may be implemented as a set of data structures, e.g. disk blocks, configured to store information, such as text. On the other hand, a directory may be implemented as a specially formatted file in which information about other files and directories are stored.
A filer may be configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, i.e., the filer. In this model, the client may comprise an application, such as a file system protocol, executing on a computer that “connects” to the filer over a computer network, such as point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the internet.
One type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. A particular example of a write-anywhere file system that is configured to operate on a filer is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc. of Sunnyvale, Calif. The WAFL file system is implemented within a microkernel as part of the overall protocol stack of the filer and associated disk storage. This microkernel is supplied as part of Network Appliance's Data ONTAP™ storage operating system, residing on the filer that processes file-service requests from network-attached clients.
As used herein, the term “storage operating system” generally refers to the computer-executable code operable on a storage system that implements file system semantics and manages data access. In this sense, Data ONTAP™ software is an example of such a storage operating system implemented as a microkernel. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
Disk storage is typically implemented as one or more storage “volumes” that comprised of a set of physical storage disks, defining an overall logical arrangement of storage space, as well as a set of “hot” spare disks which stand ready for use as needed for file services. Currently available filer implementations can serve a large number of discrete volumes. Each volume is associated with its own file system and, for purposes hereof, volume and file system shall generally be used synonymously. The disks within a volume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storage of parity information with respect to the striped data. A spare disk is one that is properly reserved by the owning filer, but is currently not in use for file services. It stands ready for use as needed for volume creation, extending existing volumes, RAID reconstructions and other disaster recovery or maintenance related file service operations. In general, a reconstruction is an operation by which a spare disk is allocated to replace an active file system disk in a particular RAID group that has failed, parity calculations regenerate the data that had been stored on the failed disk from surviving disks, and the regenerated data is written to the replacement disk.
In the example of a WAFL file system, a RAID 4 implementation is advantageously employed. This implementation specifically entails the striping of data across a group of disks, and separate storage of parity on a selected disk of the RAID group. If a single disk in a RAID 4 group fails, then that group can continue to operate in a degraded mode. The failed disk's data can be reconstructed from the surviving disk via parity calculations. As described herein, a RAID group typically comprises at least one data disk and one associated parity disk (or possibly data/parity) partitions in a single disk) arranged according to a RAID 4, or equivalent high-reliability, implementation. However, other configurations (e.g. RAID 0, RAID 1, RAID 4, RAID 5, or RAID DP (Diagonal Parity)) are contemplated. A further discussion of RAID is found in commonly owned U.S. patent application Ser. No. 10/394,819, entitled QUERY-BASED SPARES MANAGEMENT TECHNIQUE, by Loellyn Cassell, et al., the teachings of which are expressly incorporated herein by reference.
As will be described further below, each disk is divided into a series of regions that allow data writing and access to occur on the disk in a predictable manner. These regions include generally a disk label that is used by the RAID layer. The on-disk label is, in essence, self-describing information for each disk that is actively attached to the storage system. The labels are used to dynamically assemble the disks into spare pools and volumes. The process of assembling disks into spare pools and volumes, based upon the disk labels, is called “disk label assimilation.” In the case that the label identifies the disk as a part of a volume, the label is used to construct an in core configuration tree for that volume, starting from the disk object level up to the volume object level. Therefore, a label on a disk identifies that disk's participation in a RAID group and, furthermore, that group's association with plex, mirror and, ultimately, volume objects in the configuration tree. The label is located in a well-known location of the disk so that it can be queried by the RAID subsystem in accordance with, e.g., a discovery process during a boot operation. The discovery process illustratively implements a disk event thread described herein.
The storage system performs assimilation based upon disk labels and decides whether a given disk is to be placed into the general configuration of active storage, and where in the configuration it is to be placed. If a disk is deemed from its labels to be a “spare” and not part of the active storage configuration, then it is placed in a spares pool.
Other regions define the disk's table of contents, its file system area, a coredump region, into which coredump information is stored, ownership information (described below) and other relevant information, laid out in a logical and predictable manner within the disk's storage space. Certain information, like the table of contents, is located at a known offset so that the storage system can always access it when the disk is connected.
Internally, the file server or filer is a microprocessor-based computer in which one or more microprocessors are interconnected by a system bus to various system components that may be physically located on a motherboard and which include a memory, a buffer cache for storing data and commands, a network adapter for communicating over the LAN or another network, a firmware storage device such as an erasable programmable read only memory (EPROM—which may comprise a flash memory, that retains power during shutdown), that contains system firmware (including a boot mechanism), and various storage adapters for communicating with the physical disks attached to the filer.
Disks are typically enclosed in a shelf enclosure unit, or “shelf.” A shelf is a physical enclosure that primarily provides power and connectivity to its disks.
Filers can be organized into groups or “clusters” in which two or more filers are linked together so as to provide fault-tolerant computing in the event that one of the cluster partners panics or fails. If so, an unfailed cluster partner takes over handling of the operations of the failed partner and assumes control of its disks. This is facilitated by a number of “failover” functions (to be described further below) including a failover monitor in each filer and a cluster interconnect between filers that provides a communication pathway in the event of a panic or failure.
In a clustered environment, each filer is physically connected to all disks that are part of a given cluster and one particular filer is deemed to “own” the disks that comprise the volumes serviced by that filer. This ownership means that the filer is responsible for servicing the data contained on those disks, and that only the filer that “owns” a particular disk should be able to write data to that disk. This solo ownership helps ensure data integrity and coherency. In one exemplary file system, disk ownership information can be stored in two locations: a definitive ownership sector on each disk, and through the use of Small Computer System Interface (SCSI) level 3 reservations. These SCSI-3 reservations are described in SCSI Primary Commands -3, by Committee T10 of the National Committee for Information Technology Standards, which is incorporated fully herein by reference. This method of ownership of disks is described in detail in U.S. patent application Ser. No. 10/027,457 entitled SYSTEM AND METHOD OF IMPLEMENTING DISK OWNERSHIP IN NETWORKED STORAGE, which is hereby incorporated by reference. Other models of disk ownership are expressly contemplated and it will be understood to one with knowledge in the area of network storage that the disclosed invention is not limited to the methods of ownership as described above. For example, a topology-based ownership scheme can be employed. This includes a traditional A/B cluster ownership scheme in which the filer connected to the A Fibre Channel port of a given disk shelf is deemed to be the default owner of that shelf, and all of the disks it contains, while the filer connected to the B port is the takeover cluster partner. Similarly, another topology-based scheme can be employed in which disk ownership is determined in part by the switch port to which a disk is connected. This exemplary scheme defines ownership based upon the switch port bank (e.g. a group of distinct ports) into which a disk's A port is connected. For example, using a commercially available Brocade Communications Systems, Inc. (of San Jose, Calif.) 3800 series switch, having 16 ports divided into Bank 1 (ports 0-7) and Bank 2 (ports 8-15), a filer connected to Bank 1 is deemed to own disks connected to Bank 2 so as to further ensure data redundancy. This is described in detail in The FAS900 Series Appliance Cluster Guide (part #210-00342), published by Network Appliance, Inc., May 2003 (see generally Chapter 3).
Filer Failure and Takeover
As used herein, a filer in a cluster configuration “panics” or “fails” when it detects some fatal problem which prevents it from continuing to execute normally, but is nonetheless able to communicate with other nodes in the cluster, including its cluster partner. Thus, the touchstone of such failure, is the continued ability to communicate in the cluster despite loss of some functionality or performance. This can also be called “soft failure” as distinguished from “hard failure,” which occurs when the filer becomes unable to communicate with other nodes in the cluster, for example, upon loss of electrical power. Hence, a filer in which storage operating system panics is generally termed a “failed filer” (or a “failed file server”).
When a filer fails in a clustered environment, the need arises to transfer the ownership of a volume from the failed filer to another partner filer in order to provide clients with continuous access to the disks. One method of “takeover” or “failover” is described in detail in U.S. patent application Ser. No. 09/933,883 entitled, NEGOTIATED GRACEFUL TAKEOVER IN A NODE CLUSTER.
In order to assist in ascertaining the cause of the fault (e.g. to “debug” the failed filer), the failed filer or other storage system typically performs a “coredump,” operation, in which it writes its current working memory (also termed, the “coredump”) contents to disk. Later, a coredump recovery process called “savecore” reads back the coredump data and generates a “coredump file,” which it stores in the failed filer's root file system. The coredump file contains an image of the system memory and any non-volatile storage at the time the panic occurred. The image can be subsequently accessed and studied to assist in determining the cause of the failure event. This information assists in diagnosing the fault since it is a picture of the system at the time the failure occurred.
As noted below, time is of the essence in a panic scenario—thus, in order to expedite the complete creation of the coredump, the coredump operation typically spreads the coredump across specially allocated core regions located on multiple disks. Typically, the coredump file is written in (for example) 3-MB data chunks to the designated region in a set of non-broken/operative disk currently owned by the failed filer. When the designated region on a given disk fills up, that disk is taken out of the list of available disks. The 3-MB data chunks written to disks are typically uncompressed where space permits, or can be compressed where space is at a premium—and this compressed data can be written out sequentially to disks, rather than “sprayed” across the disk set, potentially filling some disks before others. Disks are numbered so that a resulting coredump file can be reassembled from the disk set at a later time.
In the case of a clustered environment, where more than one file server may be able to take control of a given disk set via ownership reservations, the coredump is only directed to owned disks of the failed filer. Because the coredump spreads the coredump over multiple disks, those disks are not otherwise accessible to the partner filer to begin the takeover process. Rather, the disks remain occupied with the actions of the failed filer in writing of the coredump. As the coredump disks must, typically, be accessed by the partner filer as part of a conventional takeover operation, the partner filer consequently delays the overall takeover process until the failed filer completes its coredump. In effect, the takeover process proceeds through two sequential steps: first coredump by the failed filer is completed, then takeover by the partner filer occurs. While the two steps (coredump and takeover) proceed, the failure may actually turn from “soft” to “hard,” with the failed filer becoming completely inaccessible before takeover is fully completed. In addition, during this delay, data handled by the failed filer is inaccessible to clients, and is not made available again until takeover is complete. It is highly desirable to reduce unavailability of data from a cluster to the greatest extent possible, particularly in a block-based (SAN) environment in which clients are highly vulnerable to data unavailability. For example, if a file server does not respond within a set period of time, the SAN protocol may issue a network-wide panic, which may, in turn, lead to a total network shutdown. Thus, to avoid undesirable (and potentially crippling downtime), the overall takeover operation, including coredump, should be performed as quickly as possible.