File Servers
A file server (also termed herein “filer”) is a computer that provides file services relating to the organization of information on storage devices, such as disks. A file server or filer includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the disks. Each “on-disk” file may be implemented as a set of data structures, e.g. disk blocks, configured to store information, such as text. On the other hand, a directory may be implemented as a specially formatted file in which information about other files and directories are stored.
A filer may be configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, i.e., the filer. In this model, the client may comprise an application, such as a file system protocol, executing on a computer that “connects” to the filer over a computer network, such as point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the internet.
One type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. A particular example of a write-anywhere file system that is configured to operate on a filer is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc. of Sunnyvale, Calif. The WAFL file system is implemented within a microkernel as part of the overall protocol stack of the filer and associated disk storage. This microkernel is supplied as part of Network Appliance's Data ONTAP™ storage operating system, residing on the filer that processes file-service requests from network-attached clients.
As used herein, the term “storage operating system” generally refers to the computer-executable code operable on a storage system that implements file system semantics and manages data access. In this sense, Data ONTAP™ software is an example of such a storage operating system implemented as a microkernel. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
Disk storage is typically implemented as one or more storage “volumes” that comprised of a set of physical storage disks, defining an overall logical arrangement of storage space. Currently available filer implementations can serve a large number of discrete volumes. Each volume is associated with its own file system and, for purposes hereof, volume and file system shall generally be used synonymously. The disks within a volume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storage of parity information with respect to the striped data. In the example of a WAFL file system, a RAID 4 implementation is advantageously employed. This implementation specifically entails the striping of data across a group of disks, and separate storage of parity on a selected disk of the RAID group. If a single disk in a RAID group fails, then that RAID group can continue to operate in a degraded mode. The failed disk's data can be reconstructed via parity calculations as described generally above. As described herein, a RAID group typically comprises at least one data disk and one associated parity disk (or possibly data/parity partitions in a single disk) arranged according to a RAID 4, or equivalent high-reliability, implementation. However, other configurations (e.g. RAID 0, RAID 1, RAID 4, RAID 5, or RAID DP (Diagonal Parity)) are contemplated. A further discussion of RAID is found in commonly owned U.S. patent application Ser. No. 10/394,819, entitled QUERY-BASED SPARES MANAGEMENT TECHNIQUE, by Loellyn Cassell, et al., the teachings of which are expressly incorporated herein by reference.
As will be described further below, each disk is divided into a series of regions that allow data writing and access to occur on the disk in a predictable manner. These regions include generally a disk label that is used by the RAID subsystem. The on-disk label is, in essence, self-describing information for each disk that is actively attached to the storage system. The labels are used to dynamically assemble the disks into spare pools and volumes. The process of assembling disks into volumes or spare pools according to the disk labels is herein termed “disk label assimilation.” In the case that the label identifies the disk as part of a volume, the label is used to construct an in core configuration tree for that volume, starting from the disk object level up to the volume object level. Therefore, a label on a disk identifies that disk's participation in a RAID group and, furthermore, that group's association with plex, mirror and, ultimately, volume objects in the configuration tree. The label is located in a well-known location of the disk so that it can be queried by the RAID subsystem in accordance with disk label assimilation.
The storage system performs assimilation based upon disk labels and decides whether a given disk is to be placed into the general configuration of active storage, and where in the configuration it is to be placed. If a disk is deemed from its labels to be a “spare” and not part of the active storage configuration, then it is placed in a spares pool.
Other regions define the disk's table of contents, its file system area, and a core region, into which coredump information and other relevant information is stored. The disk regions are laid out in a logical and predictable manner within the disk's storage space. Certain information, like the table of contents, is located at a known offset so that the storage system can always access it when the disk is connected.
Internally, the file server or filer is a microprocessor-based computer in which one or more microprocessors are interconnected by a system bus to various system components that may be physically located on a motherboard and which include a memory, a buffer cache for storing data and commands, a network adapter for communicating over the LAN or another network, a firmware storage device such as an erasable programmable read only memory (EPROM—which may comprise a flash memory, that retains power during shutdown), that contains system firmware (including a boot mechanism), and various storage adapters for communicating with the physical disks attached to the filer.
Typically, disks are physically contained within a shelf enclosure unit, or “shelf.” A shelf is a physical enclosure that primarily provides power and connectivity to its disks.
Filer Failure and Coredump
As used herein, the storage operating system that is executing on a filer “panics” or “fails” when it detects some fatal problem which prevents it from continuing to execute. This can also be called “soft failure” as distinguished from “hard failure,” which occurs when the filer becomes more significantly disabled, for example, upon loss of electrical power or upon a hardware failure. A filer whose storage operating system panics is herein termed a “failed filer” (or “failed file server”).
In order to assist in ascertaining the cause of the fault (e.g. to “debug” the failed filer), the failed filer or other storage system typically performs a “coredump,” operation, in which it writes its current working memory (also termed, the “coredump”) contents to disk. Later, a coredump recovery process that is termed herein “savecore” reads back the coredump data and generates a “coredump file,” which it stores in the failed filer's root file system for subsequent access by an appropriate utility. The coredump file contains an image of the system memory and any non-volatile storage at the time the panic occurred. The image can be subsequently accessed and studied to assist in determining the cause of the failure event. This information assists in diagnosing the fault since it is a picture of the system at the time the failure occurred.
As noted below, time is of the essence in a panic scenario—thus, in order to expedite the complete creation of the coredump, the coredump operation typically spreads the coredump across specially allocated core regions located on multiple disks. Typically, the coredump is written in (for example) 3-MB data chunks to the designated region in a set of non-broken/operative disk currently owned by the failed filer. When the designated region on a given disk fills up, that disk is taken out of the list of available disks. The 3-MB data chunks written to disks are typically uncompressed where space permits, or can be compressed where space is at a premium—and this compressed data can be written out sequentially to disks, rather than “sprayed” across the disk set, potentially filling some disks before others. Along with coredump data, a core header is written to the designated region on each and every disk in the given disk set, and includes the number of that particular disk within the set, so that a resulting coredump file can be reassembled from the disk set at a later time.
In the case of a clustered environment, where more than one file server may be able to take control of a given disk set via ownership reservations, the coredump is only directed to owned disks of the failed filer. Because the coredump spreads the coredump over multiple disks, those disks are not otherwise accessible to the partner filer to begin the takeover process. Rather, the disks remain occupied with the actions of the failed filer in writing of the coredump.
Where a coredump procedure is to be undertaken, upon detection of a failed filer storage operating system, it is often desirable to employ a single disk to receive the coredump data, rather than distributing the data across a set of on-line disks in a “sprayed” fashion. For example, in the case of a clustered configuration where two or more filers may be interconnected to provide failover capabilities with a group of disks, the ability for the failover filer to rapidly (and in parallel with coredump) gain ownership of all but the coredump disk greatly speeds takeover. Disk ownership and the process of coredump concurrent with takeover are described in detail in the above-incorporated-by-reference U.S. patent application Ser. No. 10/764,809, entitled SYSTEM AND METHOD FOR TAKEOVER OF PARTNER RESOURCES IN CONJUNCTION WITH COREDUMP. In other standalone filer/storage system implementations, the use of a single disk to receive coredump data is often desirable as well.
In general, a good candidate as a coredump disk is one of the available spare disks typically provided in a disk set for use in volume creation, extending existing volumes, RAID reconstruction, and other disaster recovery or maintenance related operations. In general, reconstruction is an operation by which a spare disk is allocated to replace an active file system disk that has failed, parity data is used to regenerate the data that had been stored on the failed disk, and the regenerated data is written to the replacement disk. More particularly, a spare disk is one that is labeled to indicate that it is a spare and so not currently assigned any file system data storage functions. It typically stands ready for use as required by the above named operations. Typically, a spare disk undergoes formatting to prepare it for use in normal/regular file services. For example, in the exemplary filer storage operating system, a spare disk initially has a label that indicates its status as a non-formatted spare. Formatting that spare entails a “zeroing” process by which a pattern of zeroes is written the disk, followed by a label update process by which labels are written to indicate this disk's table of contents and its status as a formatted spare.
In a coredump operation, the system must decide whether an appropriate candidate for a coredump disk exists, and if so, which spare disk is the best candidate for becoming a coredump disk. After the coredump is completed, there must be a mechanism to release the coredump disk status so that the disk can be redesigned as a “hot” spare.