A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
The storage operating system of the storage system may implement a high-level module, such as a file system, to logically organize the information stored on volumes as a hierarchical structure of data containers, such as files and logical units. For example, each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. These data blocks are organized within a volume block number (vbn) space that is maintained by the file system. The file system may also assign each data block in the file a corresponding “file offset” or file block number (fbn). The file system typically assigns sequences of fbns on a per-file basis, whereas vbns are assigned over a larger volume address space. The file system organizes the data blocks within the vbn space as a “logical volume”; each logical volume may be, although is not necessarily, associated with its own file system.
A known type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block is retrieved (read) from disk into a memory of the storage system and “dirtied” (i.e., updated or modified) with new data, the data block is thereafter stored (written) to a new location on disk to optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. An example of a write-anywhere file system that is configured to operate on a storage system is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc., Sunnyvale, Calif.
The storage system may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access data containers stored on the system. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the storage system by issuing file-based and block-based protocol messages (in the form of packets) to the system over the network.
A plurality of storage systems may be interconnected to provide a storage system environment configured to service many clients. Each storage system may be configured to service one or more volumes, wherein each volume stores one or more data containers. Yet often a large number of data access requests issued by the clients may be directed to a small number of data containers serviced by a particular storage system of the environment. A solution to such a problem is to distribute the volumes serviced by the particular storage system among all of the storage systems of the environment. This, in turn, distributes the data access requests, along with the processing resources needed to service such requests, among all of the storage systems, thereby reducing the individual processing load on each storage system. However, a noted disadvantage arises when only a single data container, such as a file, is heavily accessed by clients of the storage system environment. As a result, the storage system attempting to service the requests directed to that file may exceed its processing resources and become overburdened, with a concomitant degradation of speed and performance.
One technique for overcoming the disadvantages of having a single file that is heavily utilized is to stripe the file across a plurality of volumes configured as a striped volume set (SVS), where each volume, such as a data volume (DV), is serviced by a different storage system, thereby distributing the load for the single file among a plurality of storage systems. A technique for data container (such as a file) striping is described in U.S. patent application Ser. No. 11/119,278, entitled STORAGE SYSTEM ARCHITECTURE FOR STRIPING DATA CONTAINER CONTENT ACROSS VOLUMES OF A CLUSTER, now issued as U.S. Pat. No. 7,698,289 on Apr. 13, 2010 by Kazar et al., which application is hereby incorporated by reference as though fully set forth herein. According to the data container striping arrangement, each storage system may service access requests (i.e., file operations) from clients directed to the same file. File operations, such as read and write operations, are forwarded directly to the storage systems that are responsible for their portions of the data for that file.
An exemplary distributed multi-storage system architecture may comprise a plurality of storage systems organized as a cluster, wherein each storage system includes a thin front-end element that performs protocol conversion of file access protocols into a common cluster protocol for communicating with a back-end element of a storage system. The front-end element includes a local cache memory for temporarily storing (“caching”) data to serve client requests faster and more efficiently. Each back-end element serves one or more particular files or particular regions of files and, as such, maintains an authoritative version of the files or regions of files.
A front-end element of the cluster that receives a client request directed to a file initially attempts to serve that request from its local cache. However, the front-end element may not know whether its local cache is up-to-date because there may be another front-end element of the cluster that is also writing to that same file. Write requests are “pushed through” (forwarded) to the appropriate back-end element, whereas read requests are attempted to be serviced first from the local cache of the front-end element or, alternatively, at the appropriate back-end element. An issue with this clustered storage system architecture involves ensuring that a copy of a region of file data (i.e., a data buffer) stored in a local cache of a front-end element is up-to-date (“coherent”) with respect to the authoritative copy of that data at the back-end element.
An approach to ensuring coherency of data in a clustered multi-storage system having front-end and back-end elements involves distributed locking using file locks, such as range locks and/or opportunistic locks (op-locks). A range lock is a hard lock that provides exclusive access to a specific byte range within a file. The range lock is established upon request by a caller (such as a front-end element) and is released only at the request of the lock's owner (such as a back-end element). The front-end element can request and be granted a range lock that enables exclusive access to the corresponding range of the file so that it performs write operations on cached data until the back-end element instructs it to release the lock.
An op-lock is an automatically revocable soft lock that allows the front-end element to operate on a file data until such time as a conflicting operation is attempted. The front-end element can cache the data and perform read and write operations on the cached data because it knows that no other access is allowed to that data as long as it has an op-lock on the file. As soon as a second front-end element attempts a conflicting operation on the file, the back-end element blocks the conflicting operation and revokes the op-lock. In particular, the back-end element instructs the front-end element to return (“flush”) any write modifications to the back-end element and then discard the entire content of its local cache. The back-end element then unblocks the second front-end element and grants it an op-block to the file.
However, substantial overhead is required with respect to maintenance and utilization of such a distributed file system cache of file data in the clustered storage system using distributed locks. The present invention is directed to a system and method that reduces the overhead of maintaining data coherency in a clustered storage system.