As computer systems scale to enterprise levels, particularly in the context of supporting large-scale data centers, the underlying data storage systems frequently adopt the use of storage area networks (SANs). As is conventionally well appreciated, SANs provide a number of technical capabilities and operational benefits, fundamentally including virtualization of data storage devices, redundancy of physical devices with transparent fault-tolerant fail-over and fail-safe controls, geographically distributed and replicated storage, and centralized oversight and storage configuration management decoupled from client-centric computer systems management.
Architecturally, a SAN storage subsystem is characteristically implemented as a large array of Small Computer System Interface (SCSI) protocol-based storage devices. One or more physical SCSI controllers operate as the externally accessible targets for data storage commands and data transfer operations. The target controllers internally support bus connections to the data storage devices, identified as logical units (LUNs). The storage array is collectively managed internally by a storage system manager to virtualize the physical data storage devices. That is, the SCSI storage devices are internally routed and respond to the virtual storage system manager as functionally the sole host initiator accessing the SCSI device array. The virtual storage system manager is thus able to aggregate the physical devices present in the storage array into one or more logical storage containers. Virtualized segments of these containers can then be allocated by the virtual storage system as externally visible and accessible LUNs with uniquely identifiable target identifiers. A SAN storage subsystem thus presents the appearance of simply constituting a set of SCSI targets hosting respective sets of LUNs. While specific storage system manager implementation details differ as between different SAN storage device manufacturers, the desired consistent result is that the externally visible SAN targets and LUNs fully implement the expected SCSI semantics necessary to respond to and complete initiated transactions against the managed container.
A SAN storage subsystem is typically accessed by a server computer system implementing a physical host bus adapter (HBA) that connects to the SAN through network connections. Within the server, above the host bus adapter, storage access abstractions are characteristically implemented through a series of software layers, beginning with a low-level SCSI driver layer and ending in an operating system specific filesystem layer. The driver layer, which enables basic access to the target ports and LUNs, is typically vendor specific to the implementation of the SAN storage subsystem. A data access layer may be implemented above the device driver to support multipath consolidation of the LUNs visible through the host bus adapter and other data assess control and management functions. A logical volume manager (LVM), typically implemented intermediate between the driver and conventional operating system filesystem layers, supports volume oriented virtualization and management of the LUNs accessible through the host bus adapter. Multiple LUNs can be gathered and managed together as a volume under the control of the logical volume manager for presentation to and use by the filesystem layer as an integral LUN.
In typical implementation, SAN systems connect with upper-tiers of client and server computer systems through a communications matrix frequently implemented using a Fibre Channel (FC) based communications network. Logically, a Fibre Channel network is a bidirectional, full-duplex, point-to-point, serial data channel structured specifically for high performance data communication. Physically, the Fibre Channel is an interconnection of multiple communication ports, called N_Ports, implemented by the host bus adapters and target controllers. These communication ports are interconnected by a switching network deployed as a n-way fabric, a set of point-to-point links, or as an arbitrated loop.
Strictly defined, Fibre Channel is a generalized transport mechanism that has no high-level data flow protocol of its own or native input/output command set. While a wide variety of existing Upper Level Protocols (ULPs) can be implemented on Fibre Channel, the most frequently implemented is the SCSI protocol. The SCSI Fibre Channel Protocol (FCP) standard defines a Fibre Channel mapping layer that enables transmission of SCSI command, data, and status information between a source host bus adapter, acting as a SCSI initiator, and a destination SCSI target controller, over any Fibre Channel connection path as specified by a Fibre Channel path identifier. As defined relative to a target, a FC path identifier is a reference to the destination port and logical unit of the SAN storage system. The port is uniquely specified by a World Wide Port Name (WWPN). The LUN identifier is a unique, hardware independent SCSI protocol compliant identifier value retrievable in response to a standard SCSI Inquiry command.
A common alternative transport mechanism to Fibre Channel is defined by the Internet Small Computer System Interface (iSCSI) standard. Instead of relying on a new FC media infrastructure, the iSCSI standard is designed to leverage existing TCP/IP networks including specifically the existing mixed-media infrastructure, including typical intranet and internet networks, and to use internet protocol (IP) layer for upper-level command and data transport. Unlike Fibre Channel, the SCSI protocol is the exclusive upper-level protocol supported by iSCSI. That is, the iSCSI protocol semantics (IETF Internet Draft draft-ietf-ips-iSCSI-08.txt; www.ietf.org) specifically requires the transmission of SCSI command, data, and status information between SCSI initiators and SCSI targets over an IP network. Similar to the FC path, an iSCSI path, as specified by a SCSI initiator, is a combination of a target IP address and LUN identifier.
As generally illustrated in FIGS. 1A and 1B, a typical system architecture 60 implements a logical volume manager 62 on a computer system 12, that is, at a system tier above the data storage systems 16, as a software layer beneath a local filesystem layer 64. By execution of the logical volume manager 62, the filesystem layer 64 is presented with a data storage view represented by one or more discrete data storage volumes 66, each of which is capable of containing a complete filesystem data structure. The specific form and format of the filesystem data structure is determined by the particular filesystem layer 64 employed. For the preferred embodiments of the present invention, physical filesystems, including the New Technology filesystem (NTFS), the Unix filesystem (UFS), the VMware Virtual Machine filesystem (VMFS), and the Linux third extended filesystem (ext3FS), may be used as the filesystem layer 64.
As is conventional for logical volume managers, each of the data storage volumes 66 is functionally constructed by the logical volume manager 62 from an administratively defined set of one or more data storage units representing LUNs. Where the LUN storage, at least relative to the logical volume manager 62, is provided by network storage systems 16, the data storage volumes 66 are assembled from an identified set of the data storage units externally presented by the network storage systems 16. That is, the logical volume manager 62 is responsible for functionally managing and distributing data transfer operations to the various data storage units of particular target data storage volumes 66. The operation of the logical volume manager 62, like the operation of a storage system manager 24, is transparent to applications 68 executed directly by computer systems 12 or by clients of computer systems 12.
A preferred system architecture 60, implementing a virtual machine based system 70, is shown in FIG. 1C. An integral computer system 72, generally corresponding to one of the computer systems 12, is constructed on a conventional, typically server-class hardware platform 74, including in particular host bus adapters 76 in addition to conventional platform processor, memory, and other standard peripheral components (not separately shown). The server platform 74 is used to execute a virtual machine (VMKernel) operating system 78 supporting a virtual machine execution space 80 within which virtual machines (VMs) 821-N are executed. For the preferred embodiments of the present invention, the virtual machine kernel 78 and virtual machines 821-N are implemented using the ESX Server virtualization product manufactured and distributed by VMware, Inc., Palo Alto, Calif. Use of the ESX Server product and, further, implementation using a virtualized computer system 12 architecture, is not required in the practice of the present invention.
In summary, the virtual machine operating system 78 provides the necessary services and support to enable concurrent execution of the virtual machines 821-N. In turn, each virtual machine 821-N implements a virtual hardware platform 84 that supports the execution of a guest operating system 86 and one or more typically client application programs 88. For the preferred embodiments of the present invention, the guest operating systems 86 are instances of Microsoft Windows, Linux and Netware-based operating systems. Other guest operating systems can be equivalently used. In each instance, the guest operating system 86 includes a native filesystem layer, typically either an NTFS or ext3FS type filesystem layer. These filesystem layers interface with the virtual hardware platforms 84 to access, from the perspective of the guest operating systems 86, a data storage host bus adapter. In the preferred implementation, the virtual hardware platforms 84 implement virtual host bus adapters 90 that provide the appearance of the necessary system hardware support to enable execution of the guest operating system 86 transparent to the virtualization of the system hardware.
Filesystem calls initiated by the guest operating systems 86 to implement filesystem-related data transfer and control operations are processed and passed through the virtual host bus adapter 90 to adjunct virtual machine monitor (VMM) layers 921-N that implement the virtual system support necessary to coordinate operation with the virtual machine kernel 78. In particular, a host bus emulator 94 functionally enables the data transfer and control operations to be ultimately passed to the host bus adapters 76. The system calls implementing the data transfer and control operations are passed to a virtual machine filesystem (VMFS) 96 for coordinated implementation with respect to the ongoing operation of all of the virtual machines 821-N. That is, the native filesystems of the guest operating systems 86 perform command and data transfer operations against virtual SCSI devices presenting LUNs visible to the guest operating systems 86. These virtual SCSI devices are based on emulated LUNs actually maintained as files resident within the storage space managed by the virtual machine filesystem 96. In this respect, the virtual machine filesystem 96 is to the virtual machines 821-N what the storage system 16 is to the physical computer systems 12. Permitted guest operating system 86 command and data transfer operations against the emulated LUNs are mapped between the LUNs visible to the guest operating systems 86 and the data storage volumes visible to the virtual machine filesystem 96. A further mapping is, in turn, performed by a virtual machine kernel-based logical volume manager 62 to the LUNs visible to the logical volume manager 62 through the data access layers 98, including device drivers, and host bus adapters 76. The system illustrated in FIGS. 1A-C is disclosed in greater detail in commonly owned U.S. patent application Ser. No. 11/431,277, entitled “System and Methods for Automatically Re-Signaturing Multi-Unit Data Storage Volumes in Distributed Data Storage Systems”, filed 9 May 2006, the subject matter of which is incorporated herein by reference for all purposes.
Distributed locks are locks that can be used to synchronize the operations of multiple nodes within a computer system. Such nodes may be present within the same computer or distributed among different computers interconnected by a network. A lock is a mechanism utilized by a node to gain access to a system resource and to handle competing requests among multiple nodes in an orderly and efficient manner. Prior art distributed locks are most commonly implemented using a network lock manager, wherein each lock is associated with a node that is the current manager of the lock. When a particular node N wants to acquire a lock, that node must talk to the current manager node M of the lock via an IP network. The manager node M can then grant the lock to node N or indicate that the lock is currently held by another node. Issues arise with a network lock manager when the IP network used for such communications is not working. One solution, is to elect a new manager for the lock, however, this creates many complicated implementation issues. In some systems, the most reliable network available is the storage area network (SAN), rather than the IP network. As a result, a more reliable way to implement distributed locks is to maintain the lock data structure on disk and use the SAN to access them. In order to deal with possible crashes of nodes, the distributed locks can be lease-based. That is, a node that holds a lock must renew a “lease” on the lock before the lease expires, typically by incrementing or otherwise changing a “pulse field” in the on-disk lock data structure associated with the lock, to indicate that the node still holds the lock and has not crashed. Another node can break the lock if the lease has not been renewed by the current holder for the duration of the lease. A prior application which addresses this problem is disclosed in commonly owned U.S. patent application Ser. No. 10/773,613 entitled “Providing Multiple Concurrent Access to a File System”, filed Feb. 6, 2004, the subject matter of which is incorporated herein by reference for all purposes. One problem with the lease-based scheme is that if a node holds many locks, a node must expend considerable resources simply to renew the leases on the locks it is currently holding. Shared resources like disk and network bandwidth are expended as well.
Some clustering software, such as RedHat Cluster Suite, commercially available from Red Hat, Raleigh, N.C., and Veritas Cluster Service, commercially available from Symantec Corporation, Cupertino, Calif., transmit a heartbeat to a “quorum” disk as an extra way to help determine if a node is down. However, these systems do not specifically implement locks, and their primary method of detecting if a node is alive is via an IP network, which suffers from the same network failure vulnerability described previously.
A prior attempt at utilizing on-disk heartbeats can be found in the Oracle Cluster File System, OCFS2, a clustered (distributed) file system developed by Oracle Corporation and released under the GNU General Public License which utilizes an on-disk heartbeat to determine which members of the cluster are actually alive. However, such system has a separate lock manager which is network-based that implements distributed locks, and, accordingly, suffers from the same network failure vulnerability described previously.
Accordingly, need exists for an approach to implementing lease-based distributed locks which does not require a separate, network based lock manager.
A further need exists for an approach to implementing lease-based distributed locks which is scalable and does not require multiple renewal processes or additional resources per renewal of each lock.
Yet a further need exists for a technique for implementing lease-based distributed locks which can accommodate a node's connection to its disk being interrupted for variable periods of time.