1. Field of Invention
The invention generally relates to high reliability electronic data storage, and, more particularly, to an architecture for decomposing a data access request into a number of smaller tasks.
2. Description of Related Art
A file server is a computer that provides file service relating to the organization of information on writeable persistent storage devices, such as memories, tapes or disks of an array. The file server or filer may be embodied as a storage system including an operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on a storage devices (e.g., disks). Each “on-disk” file may be implemented as set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
A file server may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on the server. In this model, the client may comprise an application executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. Each client may request the services of the file system on the filer by issuing file system protocol messages (in the form of packets) to the system over the network. It should be noted, however, that the filer may alternatively be configured to operate as an assembly of storage devices that is directly attached to a (e.g., client or “host”) computer. Here, a user may request the services of the file system to access (i.e., read and/or write) data from/to the storage devices (e.g., data access request).
A common type of file system is a “write in-place” file system, an example of which is the conventional Berkeley fast file system. In a write in-place file system, the locations of the data structures, such as data blocks, on disk are typically fixed. Changes to the data blocks are made “in-place” in accordance with the write in-place file system. If an update to a file extends the quantity of data for the file, an additional data block is allocated.
In the operation of a disk array, it is fairly common that a disk, or other storage medium, such as tape, will fail. A goal of a high performance storage system is to make the mean time to data loss (MTTDL) as long as possible, preferably much longer than the expected service life of the system. Data can be lost when one or more storage devices fail, making it impossible to recover data from the device. Typical schemes to avoid loss of data include mirroring, backup and parity protection. Mirroring stores the same data on two or more disks, so that if one disk fails, the mirror disk can be used to read data. Backup periodically copies data on one disk to another disk assuming thereby that both disks are unlikely to fail simultaneously. Parity schemes are common because they provide a redundant encoding of the data that allows for the loss of one or more disks without the loss of data while only requiring a minimal number of additional disk drives in the storage system.
Parity protection is used in computer systems to protect against loss of data on a storage device, such as a disk. A parity value may be computed by summing (usually modulo 2) data of a particular word size (usually one bit) across a number of similar disks holding different data and then storing the results on the disks. That is, parity may be computed on 1-bit wide vectors, composed of bits in predetermined positions on each of the disks. Addition and subtraction on 1-bit vectors are an equivalent to an exclusive-OR (XOR) logical operation, and the addition and subtraction operations can be replaced by XOR operations. The data is then protected against the loss of any of the disks. If the disk storing the parity is lost, the parity can be regenerated from the data. If one of the data disks is lost, the data can be regenerated by adding the contents of the surviving data disks together and then subtracting the result from the stored parity.
In one embodiment, typically, the disks are divided into parity groups, each of which comprises one or more data disks and a parity disk. The disk space is divided into stripes, with each stripe containing one block from each disk. The blocks of a stripe are usually at the equivalent locations on each disk in the parity group. Within a stripe, all but one block are blocks containing data (“data blocks”) and one block is a block containing parity (“parity block”) computed by the XOR of all the data from all the disks. If the parity blocks are all stored on one disk, thereby providing a single disk that contains all (and only) parity information, the system is referred to as a RAID level four implementation. If the parity blocks are contained within different disks in each stripe, usually in a rotating pattern, then the implementation is RAID level five. In addition to RAID levels four and five, one skilled in the art knows that there are several other well-known RAID levels and hybrid combinations of those RAID levels.
In a known implementation, the file system operating on top of a RAID subsystem treats the RAID disk array as a large collection of blocks wherein each block is numbered sequentially across the RAID disk array. The data blocks of a file are scattered across the data disks to fill each stripe as fully as possible, thereby placing each data block in a stripe on a different disk. Once N data blocks of a first stripe are allocated to N data disks of the RAID array, remaining data blocks are allocated on subsequent stripes in the same fashion until the entire file is written in the RAID array. Thus, a file is written across the data disks of a RAID system in stripes comprising modulo N data blocks. As stripes are filled, they are sent to the RAID subsystem to be stored.
In a known implementation, the RAID subsystem performs locking and I/O tasks on the stripe level, with these tasks being implemented through a collection of dedicated stripe owner threads. Each thread performs synchronous I/O on one stripe at a time, with additional I/O requests on the same stripe being queued up on that stripe owner (providing mutual exclusion). The limited number of threads used for stripe I/O and XOR operations can lead to bottlenecks, particularly during reconstruction, affecting system response time.
In a known implementation, RAID state transitions due to disk failures and removals are sometimes not properly coordinated with the I/O path. This can result in buffers that refer to unusable disks, which can lead to errors, from which the system may not be able to recover. As stated above, I/O from the RAID stripe owners to the disk driver is synchronous. This, combined with the fact that the I/O path may handle state transitions, can lead to deadlock situations.
The resources used by these known systems (threads and memory buffers) are statically allocated during boot and a simple reservation mechanism exists to reserve buffers before performing an I/O. Such an allocation, typically accounting for worst-case error handling, results in a large allocation of resources that are never used, but nevertheless allocated and not available to other I/O threads in the system. This can reduce the system's ability to adapt to load and configuration changes.