Advances in communication technology have allowed numbers of machines to be aggregated into computing clusters of effectively unbounded processing power and storage capacity that can be used to solve much larger problems than could a single machine. Because clusters are composed of independent and effectively redundant computers, they have a potential for fault-tolerance. This makes them suitable for other classes of problems in which reliability is paramount. As a result, there has been a great interest in clustering technology in the past several years.
Cluster file systems found in the arts include IBM's General Parallel File System (GPFS). GPFS is a parallel, shared-disk file system for cluster computers available on the RS/6000 SP parallel supercomputer and on Linux clusters that provides, as closely as possible the behavior of a general-purpose POSIX file system running on a single machine.
One drawback of clusters is that programs must be partitioned to run on multiple machines. It can be difficult for these partitioned programs to cooperate or share resources. Perhaps one of the most important resource is the file system. In the absence of a cluster file system, individual components of a partitioned program share cluster storage in an ad-hoc manner. This can complicate programming, limit performance, and compromise reliability.
Some cluster file systems allow client nodes direct access to metadata, such as directories and file attributes stored on data servers alongside the file data (distributed metadata), and use a distributed locking protocol to synchronize updates to these metadata. Other systems, such as SAN-FS, Lustre and P-NFS, use one or more dedicated metadata server nodes to handle metadata. The advantage of a cluster file system over a traditional file server is that by distributing data over many data servers, higher aggregate data throughput can be provided.
Traditional supercomputing applications, when run on a cluster, require parallel access from multiple nodes within a file shared across the cluster. Other applications, including scalable file and web servers and large digital libraries, are often characterized by interfile parallel access. In the latter class of applications, data in individual files is not necessarily accessed in parallel. But since the files reside in common directories and allocate space on the same disks, file system data structures (metadata) are still accessed in parallel. In large computing systems, even administrative actions such as adding or removing disks from a file system or rebalancing files across disks, can involve a great amount of work.
The disks of a clustered file system may be spread across some or all of the nodes that make up the cluster.
Many disk drive systems rely on standardized buses, such as the Small Computer System Interface (SCSI) bus to connect the host computer to the controller and to connect the controller and the disk drives. SCSI is a communications protocol standard that has become increasingly popular for interconnecting computers and other I/O devices. To do so, SCSI is layered logically. This layering allows software interfaces to remain relatively unchanged while accommodating new physical interconnect schemes based upon serial interconnects such as Fibre Chanel and Serial Storage Architecture (SSA). The first version of SCSI (SCSI-1) is described in ANSI X3.131-1986. The SCSI standard has undergone many revisions as drive speeds and capacities have increased. The SCSI-3 specification is designed to further improve functionality and accommodate high-speed serial transmission interfaces.
When a node failure is detected, one cannot be sure if the node is physically down or if the communication network has failed making it look as if the node were down when in fact the node may very well still be active. Consequently file system log recovery must be delayed long enough to make sure that the failed node will not be able to do any I/O after this point in time until the state of the failed node can be ascertained with certainty. As the systems grow in complexity, it is increasingly less desirable to have interrupting failures at either the disk drive or at the controller level. As a result, systems have become more reliable. Nevertheless, it is more than an inconvenience to the user should the disk drive system go down or off-line; even though the problem is corrected relatively quickly.
High availability cluster multiprocessing may use SCSI's Reserve/Release commands to control access to disk storage devices when operating in non-concurrent mode. Shared non-concurrent access to logical volumes through multiple paths using SCSI-3 Persistent Reserve commands is described in U.S. Pat. No. 6,954,881, which is incorporated herein by reference. High availability cluster multiprocessing provides a way to failover access to disk storage devices to another node because of hardware or software failures.
Persistent Reserve is a technique which refers to a set of SCSI-3 standard commands and command options which provide SCSI initiators with the ability to establish, preempt, query, and reset a reservation policy with a specified target device. The functionality provided by the Persistent Reserve commands is a superset of the Reserve/Release commands. Persistent Reserve is insufficient to provide the data integrity and rapid recovery that is often required in large file systems.
What is needed is a method which quickly and efficiently prevents a node from unfencing itself and mounting a file system subsequent to a communication failure.